CN109918498B

CN109918498B - Problem warehousing method and device

Info

Publication number: CN109918498B
Application number: CN201910038367.5A
Authority: CN
Inventors: 张建威
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-01-16
Filing date: 2019-01-16
Publication date: 2023-08-11
Anticipated expiration: 2039-01-16
Also published as: CN109918498A

Abstract

The embodiment of the invention provides a problem warehousing method and device. The invention relates to the technical field of artificial intelligence, which comprises the following steps: converting the target question into a target sentence vector; acquiring a cluster center point of each cluster category in N pre-calculated cluster categories; calculating the distance between the target sentence vector and the cluster center point of each cluster category in the N cluster categories to obtain N distances; screening out the cluster category corresponding to the smallest distance in the N distances to obtain a target cluster category; respectively calculating the similarity between the target sentence vector and sentence vectors corresponding to all problems in the target cluster category to obtain M similarity; and if the M similarity is smaller than or equal to the preset similarity threshold, storing the target problem into a knowledge base. The technical scheme provided by the embodiment of the invention can solve the problems of high calculation cost and low efficiency caused by the similarity between the problem (namely the new problem) needing to calculate the target and all the problems stored in the knowledge base in the prior art.

Description

Problem warehousing method and device

[ field of technology ]

The invention relates to the technical field of artificial intelligence, in particular to a problem warehousing method and device.

[ background Art ]

Before adding a new problem into the knowledge base, the new problem needs to be compared with all the problems in the knowledge base so as to judge whether the problem which is repeated with the new problem exists in the knowledge base, and if the problem which is repeated with the new problem exists in the knowledge base, the new problem is not added into the knowledge base; if there is no problem in the knowledge base that duplicates the new problem, the new problem is added to the knowledge base.

The problems with this approach are: the similarity between the target problem (i.e. the new problem) and all the problems stored in the knowledge base needs to be calculated, and the calculation cost is high and the efficiency is low.

[ invention ]

In view of the above, the embodiments of the present invention provide a method and an apparatus for problem warehousing, which are used to solve the problems of high computational overhead and low efficiency caused by the similarity between the problem (i.e. new problem) requiring the calculation target and all the problems stored in the knowledge base in the prior art.

In one aspect, an embodiment of the present invention provides a problem warehousing method, where the method includes: acquiring a target problem; converting the target question into a target sentence vector; acquiring a pre-calculated cluster center point of each of N cluster categories, wherein the N cluster categories and the cluster center point of each of the N cluster categories are obtained through the following steps: acquiring a plurality of questions in a knowledge base; respectively converting the problems into sentence vectors to obtain a plurality of sentence vectors; clustering the sentence vectors to obtain N clustering results, wherein the N clustering results comprise N clustering categories and clustering center points of each clustering category in the N clustering categories, and N is a natural number greater than or equal to 2; calculating the distance between the target sentence vector and the cluster center point of each cluster category in the N cluster categories to obtain N distances; screening out the cluster category corresponding to the smallest distance in the N distances to obtain a target cluster category; respectively calculating the similarity between the target sentence vector and sentence vectors corresponding to all problems in the target cluster category to obtain M similarity, wherein M is the number of the problems in the target cluster category; comparing the M similarity with a preset similarity threshold value respectively; and if the M similarity values are smaller than or equal to the preset similarity threshold value, storing the target problem into the knowledge base.

Further, the calculating the similarity between the target sentence vector and the sentence vectors corresponding to all the questions in the target cluster category includes: forming a first sentence vector set by the target sentence vector and sentence vectors corresponding to all problems in the target cluster category; performing outlier calculation on the first sentence vector set; judging whether the target sentence vector is an outlier according to the result of outlier calculation; and if the target sentence vector is not an outlier, respectively calculating the similarity between the target sentence vector and sentence vectors corresponding to all problems in the target cluster category.

Further, the method further comprises: if at least one of the M similarities is larger than the preset similarity threshold, screening out a problem corresponding to the similarity larger than the preset similarity threshold; taking the screened problems as similar problems of the target problems; and outputting prompt information, wherein the prompt information is used for prompting a user that similar problems exist in the knowledge base, and the prompt information carries the similar problems of the target problems.

Further, the selecting the problem as the similar problem of the target problem includes: forming sentence vectors corresponding to all problems in the target cluster category into a second sentence vector set; performing outlier calculation on the second sentence vector set; judging whether the screened problem is an outlier according to the outlier calculation result; and if the screened question is not an outlier, taking the screened question as a similar question of the target question.

Further, the clustering the plurality of sentence vectors to obtain N clustering results includes: s1, determining an N value according to priori experience, wherein N is the cluster number of the clusters; s2, randomly selecting N sentence vectors as cluster center points of N cluster categories; s3, for a first sentence vector, calculating the distance between the first sentence vector and each cluster center point in N cluster center points, and classifying the first sentence vector into a category corresponding to the cluster center point with the closest distance to the first sentence vector, wherein the first sentence vector is any one of the remaining L-N sentence vectors, and L is the total number of the sentence vectors; s4, after all sentence vectors are classified, recalculating cluster center points of N categories according to the sentence vectors in each category, updating the cluster center points of the N categories, and circularly executing S3 and S4 until the distance between two adjacent cluster center points of each category in the N categories is within a preset distance.

Further, calculating a distance between the target sentence vector and the cluster center point includes: according to the formulaCalculating the similarity between the target sentence vector and the clustering center point, wherein S represents the similarity between the target sentence vector and the clustering center point, A represents the target sentence vector, B represents the clustering center point, A _i An i-th element representing the target sentence vector, B _i An ith element representing the cluster center point, n representing that the target sentence vector containsNumber of elements.

In one aspect, an embodiment of the present invention provides a problem warehousing apparatus, including: a first acquisition unit configured to acquire a target problem; a conversion unit for converting the target question into a target sentence vector; a second obtaining unit, configured to obtain a cluster center point of each of N pre-computed cluster categories, where the N cluster categories and the cluster center point of each of the N cluster categories are obtained by: acquiring a plurality of questions in a knowledge base; respectively converting the problems into sentence vectors to obtain a plurality of sentence vectors; clustering the sentence vectors to obtain N clustering results, wherein the N clustering results comprise N clustering categories and clustering center points of each clustering category in the N clustering categories, and N is a natural number greater than or equal to 2; the first calculation unit is used for calculating the distance between the target sentence vector and the cluster center point of each cluster category in the N cluster categories to obtain N distances; the first screening unit is used for screening the cluster category corresponding to the smallest distance in the N distances to obtain a target cluster category; the second calculation unit is used for calculating the similarity between the target sentence vector and sentence vectors corresponding to all problems in the target cluster category respectively to obtain M similarity, wherein M is the number of the problems in the target cluster category; the comparison unit is used for comparing the M similarity with a preset similarity threshold value respectively; and the storage unit is used for storing the target problem into the knowledge base if the M similarity values are smaller than or equal to the preset similarity threshold value.

Further, the second calculation unit includes: a first determining subunit, configured to form a first sentence vector set from the target sentence vector and sentence vectors corresponding to all questions in the target cluster category; the first calculating subunit is used for calculating outliers of the first sentence vector set; the first judging subunit is used for judging whether the target sentence vector is an outlier according to the result of outlier calculation; and the second calculating subunit is used for respectively calculating the similarity between the target sentence vector and the sentence vectors corresponding to all the problems in the target cluster category if the target sentence vector is not an outlier.

Further, the apparatus further comprises: a second screening unit, configured to screen out a problem corresponding to a similarity greater than the preset similarity threshold if at least one of the M similarities is greater than the preset similarity threshold; a determining unit, configured to take the screened problem as a similar problem of the target problem; the output unit is used for outputting prompt information, the prompt information is used for prompting the user that similar problems exist in the knowledge base, and the prompt information carries the similar problems of the target problems.

Further, the determining unit includes: the second determining subunit is configured to form sentence vectors corresponding to all the questions in the target cluster category into a second sentence vector set; the third computation subunit is used for performing outlier computation on the second sentence vector set; the second judging subunit is used for judging whether the screened problem is an outlier according to the result of outlier calculation; and a third determining subunit, configured to take the screened question as a similar question of the target question if the screened question is not an outlier.

Further, the first computing unit is configured to: according to the formulaCalculating the similarity between the target sentence vector and the clustering center point, wherein S represents the similarity between the target sentence vector and the clustering center point, A represents the target sentence vector, B represents the clustering center point, A _i An i-th element representing the target sentence vector, B _i And the ith element representing the cluster center point, and n represents the number of elements contained in the target sentence vector.

In one aspect, an embodiment of the present invention provides a storage medium, where the storage medium includes a stored program, and when the program runs, the device where the storage medium is controlled to execute the problem warehousing method described above.

In one aspect, an embodiment of the present invention provides a computer device, including a memory and a processor, where the memory is configured to store information including program instructions, and the processor is configured to control execution of the program instructions, and when the program instructions are loaded and executed by the processor, implement the steps of the problem entry method described above.

In the embodiment of the invention, the existing problems in the knowledge base are divided into N cluster categories, the target problems are converted into target sentence vectors, the cluster center point of each cluster category in the N cluster categories calculated in advance is obtained, the distance between the target sentence vector and the cluster center point of each cluster category in the N cluster categories is calculated, N distances are obtained, the cluster category corresponding to the smallest distance in the N distances is screened out, the target cluster category is obtained, the similarity between the target sentence vector and sentence vectors corresponding to all problems in the target cluster category is calculated respectively, M similarity is obtained, M similarity is the number of problems in the target cluster category, M similarity is compared with a preset similarity threshold value respectively, if the M similarity is smaller than or equal to the preset similarity threshold value, the problem that the target problems are repeated with the target problems is not existed in the knowledge base is illustrated, the target problems are stored in the knowledge base, the calculated cost is greatly reduced, the calculation efficiency is improved, the calculation cost is greatly reduced, the calculation cost is greatly increased, and the problem in the prior art that the calculation cost of the target problems in the knowledge base is required to be greatly reduced.

[ description of the drawings ]

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of an alternative problem warehousing method according to an embodiment of the invention;

FIG. 2 is a schematic diagram of an alternative problem warehousing apparatus according to an embodiment of the invention;

FIG. 3 is a schematic diagram of an alternative computer device provided by an embodiment of the present invention.

[ detailed description ] of the invention

For a better understanding of the technical solution of the present invention, the following detailed description of the embodiments of the present invention refers to the accompanying drawings.

It should be understood that the described embodiments are merely some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be understood that the term "and/or" as used herein is merely one relationship describing the association of the associated objects, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.

FIG. 1 is a flow chart of an alternative problem warehousing method according to an embodiment of the invention, as shown in FIG. 1, the method comprising:

step S102, obtaining a target problem.

Step S104, converting the target question into a target sentence vector.

Step S106, obtaining a pre-calculated cluster center point of each of N cluster categories, wherein the N cluster categories and the cluster center point of each of the N cluster categories are obtained through the following steps: acquiring a plurality of questions in a knowledge base; respectively converting the problems into sentence vectors to obtain a plurality of sentence vectors; clustering the sentence vectors to obtain N clustering results, wherein the N clustering results comprise N clustering categories and clustering center points of each clustering category in the N clustering categories, and N is a natural number greater than or equal to 2.

Step S108, calculating the distance between the target sentence vector and the cluster center point of each cluster category in the N cluster categories to obtain N distances.

Step S110, screening out the cluster category corresponding to the smallest distance in the N distances to obtain the target cluster category.

Step S112, calculating the similarity between the target sentence vector and the sentence vectors corresponding to all the problems in the target cluster category respectively to obtain M similarity, wherein M is the number of the problems in the target cluster category.

In step S114, the M similarities are compared with the preset similarity threshold.

In step S116, if the M similarities are less than or equal to the preset similarity threshold, the target problem is stored in the knowledge base.

The value of N may be determined empirically a priori.

A priori experience refers to the classification of all problems in the prior knowledge base, so the N value can be determined from the existing classification.

The preset similarity threshold may be set according to actual requirements, for example, set to 80%, 85%, 90%, 95%, etc.

The problem warehousing method provided by the embodiment of the invention is illustrated by way of example. For example, assuming n=20, obtaining a target question, converting the target question into a target sentence vector, calculating a distance between the target sentence vector and a cluster center point of each of 20 cluster categories, obtaining 20 distances, assuming that the smallest distance corresponds to a cluster center point of a 5 th cluster category, taking the 5 th cluster category as the target cluster category, assuming that the 5 th cluster category includes 35 questions (m=35) in total, respectively calculating similarities between the target sentence vector and sentence vectors corresponding to the 35 questions in the target cluster category, obtaining 35 similarities, assuming that a preset similarity threshold is 90%, respectively comparing the 35 similarities with the preset similarity threshold 90%, and if the 35 similarities are smaller than or equal to the preset similarity threshold 90%, considering that no question is repeated with the target question in the knowledge base, and storing the target question in the knowledge base.

It should be noted that: the cluster center point of each of the N cluster categories is calculated in advance, the distance between the target sentence vector and the cluster center point of each of the N cluster categories is calculated only once in advance, and the distance is directly used.

Clustering the sentence vectors to obtain N clustering results, including: s1, determining an N value according to priori experience, wherein N is the cluster number of the clusters; s2, randomly selecting N sentence vectors as cluster center points of N cluster categories; s3, for the first sentence vector, calculating the distance between the first sentence vector and each cluster center point in the N cluster center points, classifying the first sentence vector into a category corresponding to the cluster center point closest to the first sentence vector, wherein the first sentence vector is any one of the remaining L-N sentence vectors, and L is the total number of the sentence vectors; s4, after all sentence vectors are classified, recalculating cluster center points of N categories according to the sentence vectors in each category, updating the cluster center points of the N categories, and circularly executing S3 and S4 until the distance between two adjacent cluster center points of each category in the N categories is within a preset distance.

The clustering process of the clustering algorithm adopted in the embodiment of the invention is as follows: assuming that the number of sentence vectors to be clustered is L, firstly, randomly selecting N sentence vectors from the L sentence vectors as initial clustering centers; for each of the remaining L-N sentence vectors, the sentence vectors are respectively assigned to clusters most similar to the sentence vectors (represented by the cluster centers) according to the similarity (distance) between the sentence vector and the cluster centers; then calculating the cluster center of each new cluster; this process is repeated until the standard measure function begins to converge. The mean square error is typically used as a standard measure function. The clustering algorithm has the following characteristics: each cluster is as compact as possible, and each cluster is separated as much as possible, so that sentence vector similarity in the same cluster is higher; and sentence vectors in different clusters are less similar.

Optionally, if at least one of the M similarities is greater than a preset similarity threshold, screening out a problem corresponding to the similarity greater than the preset similarity threshold; taking the screened problems as similar problems of target problems; and outputting prompt information, wherein the prompt information is used for prompting similar problems existing in a user knowledge base, and the prompt information carries the similar problems of the target problems.

For example, assuming n=20, obtaining a target question, converting the target question into a target sentence vector, calculating a distance between the target sentence vector and a cluster center point of each of 20 cluster categories, obtaining 20 distances, assuming that a minimum distance corresponds to a cluster center point of a 5 th cluster category, taking the 5 th cluster category as the target cluster category, assuming that the 5 th cluster category includes 35 questions (m=35), respectively calculating similarities between the target sentence vector and sentence vectors corresponding to the 35 questions in the target cluster category, obtaining 35 similarities, assuming that a preset similarity threshold is 90%, respectively comparing the 35 similarities with the preset similarity threshold 90%, and if 2 similarities in the 35 similarities are greater than the preset similarity threshold 90%, screening out questions corresponding to the 2 similarities, assuming that the screened questions are questions Q1 and Q2, that is, the similarities between the sentence vectors corresponding to the questions Q1 and Q2 and the target sentence vector are greater than 90%, and the Q1 and Q2 may be repeated with the target question. In this case, a prompt message is output, where the prompt message is used to prompt that similar problems exist in the user knowledge base, and the prompt message carries the problem Q1 and the problem Q2, for example, the prompt message may be: questions Q1 and Q2 in the knowledge base and questions you input may be duplicate questions.

There is a possibility that when clustering problems existing in the knowledge base, the error of the clustering is large, and some problems are classified as wrong. If at least one of the M similarities is larger than a preset similarity threshold, screening out a problem corresponding to the similarity larger than the preset similarity threshold; sentence vectors corresponding to all problems in the target cluster category are formed into a second sentence vector set; performing outlier calculation on the second sentence vector set; judging whether the screened problem is an outlier according to the result of outlier calculation; if the screened question is an outlier, it is stated that the question is likely to be misclassified, that the question should not be classified into the target cluster category, nor should the question be treated as a similar question to the target question.

Optionally, forming a first sentence vector set by the target sentence vector and sentence vectors corresponding to all problems in the target cluster category; performing outlier calculation on the first sentence vector set; judging whether the target sentence vector is an outlier according to the result of the outlier calculation; if the target sentence vector is not an outlier, calculating the similarity between the target sentence vector and sentence vectors corresponding to all problems in the target cluster category respectively to obtain M similarity, wherein M is the number of the problems in the target cluster category, comparing the M similarity with a preset similarity threshold respectively, and if the M similarity is smaller than or equal to the preset similarity threshold, storing the target problem in a knowledge base.

For example, assuming n=20, acquiring a target problem, converting the target problem into a target sentence vector, calculating a distance between the target sentence vector and a cluster center point of each of 20 cluster categories to obtain 20 distances, assuming that the smallest distance corresponds to the cluster center point of the 5 th cluster category, taking the 5 th cluster category as the target cluster category, assuming that the 5 th cluster category includes 35 questions (m=35) in total, forming a first sentence vector set from the target sentence vector and sentence vectors corresponding to all questions in the 5 th cluster category, and then the first sentence vector set includes 36 questions, and performing outlier calculation on the first sentence vector set; judging whether the target sentence vector is an outlier according to the result of the outlier calculation, if the target sentence vector is the outlier, the rule that the target sentence vector is not the same as other sentence vectors in the first sentence vector set is explained, which may be caused by that the target problem is classified into the wrong category, i.e. the target problem should not be classified into the 5 th clustered category, in which case the target problem needs to be reclassified. The distance between the target sentence vector and the cluster center point of each of the 20 cluster categories has been calculated before, and 20 distances are obtained, the 20 distances are arranged in order from small to large, the smallest distance corresponds to the cluster center point of the 5 th cluster category, and the next small distance corresponds to the cluster center point of the 12 th cluster category, at this time, the target problem may be classified into the 12 th cluster category, and it is assumed that the 12 th cluster category includes 226 (m=226) problems. Forming a first sentence vector set by the target sentence vector and sentence vectors corresponding to all questions in the 12 th clustering category (the first sentence vector set comprises 227 questions); performing outlier calculation on the first sentence vector set; judging whether the target sentence vector is an outlier according to the result of the outlier calculation; if the target sentence vector is not an outlier, calculating the similarity between the target sentence vector and sentence vectors corresponding to 226 problems in the target cluster category respectively to obtain 226 similarities, comparing the 226 similarities with a preset similarity threshold value of 90%, and if the 226 similarities are smaller than or equal to the preset similarity threshold value of 90%, indicating that the problem repeated with the target problem does not exist in the knowledge base, storing the target problem in the knowledge base.

The outlier analysis is a method for mining data different from most data, specifically, an isolated forest algorithm can be used as the outlier analysis, the isolated forest algorithm is an algorithm based on a division idea and consists of a large number of isolated trees, and is used for mining abnormal data, or the outlier mining is used for finding out data which is not in accordance with the rules of other data in a large pile of data. For example: assuming that there are n problem sentence vectors in a class, a plurality of isolated trees need to be trained and constructed first. The method for training the t-th isolated tree comprises the following steps: randomly extracting m questions from n questions as training samples of a t-th isolation tree, randomly selecting one sample from the m training samples as a value of a root node of a binary tree, performing binary division on the m training samples according to the value of the root node, dividing the m training samples smaller than the value to the left of the node, dividing the m training samples larger than or equal to the value to the right of the node to obtain a splitting condition and data sets on the left and right sides, and repeating binary division on the data sets on the left and right sides respectivelyThe process is that until the data is not subdivided, the t-th isolated tree is determined, wherein t is a natural number greater than 2. The sentence vectors of the n questions are downwards moved on the t-th isolated tree along the corresponding conditional branches until the leaf nodes, and the number of edges, namely the path length h, of the sentence vectors of the X questions, which pass through from the root nodes to the leaf nodes is recorded _t (x) And then determining the average value h (X) of the path length of the sentence vector of the X-th question according to the path lengths of the sentence vectors of the X-th question in a plurality of isolated trees. Whether a sentence vector is an outlier can be determined by calculating the outlier score of the sentence vector. The anomaly score formula for each sentence vector is calculated as: wherein m is the number of training samples, +.>Epsilon is the Euler constant and has a value of 0.5772156649.s (x, m) is the value of the anomaly score, and the value range is [0,1]The closer the value of the anomaly score is to 1, the higher the likelihood that the sentence vector of the xth question is an outlier.

The distance between the target sentence vector and the cluster center point can be calculated using the following formula: calculating the similarity between the target sentence vector and the clustering center point, wherein S represents the similarity between the target sentence vector and the clustering center point, A represents the target sentence vector, B represents the clustering center point, A _i The i element of the target sentence vector, B _i The i-th element representing the cluster center point, and n represents the number of elements contained in the target sentence vector.

The formula for calculating the similarity between the target sentence vector and the sentence vector corresponding to any problem in the target cluster category is similar to the above formula, and will not be described again.

An embodiment of the present invention provides a problem warehousing apparatus, which is configured to execute the problem warehousing method, and fig. 2 is a schematic diagram of an alternative problem warehousing apparatus according to an embodiment of the present invention, as shown in fig. 2, where the apparatus includes: the first acquisition unit 12, the conversion unit 14, the second acquisition unit 16, the first calculation unit 18, the first screening unit 20, the second calculation unit 22, the comparison unit 24, and the storage unit 26.

A first acquisition unit 12 for acquiring a target problem.

A conversion unit 14 for converting the target question into a target sentence vector.

A second obtaining unit 16, configured to obtain a cluster center point of each of the N pre-calculated cluster categories, where the N cluster categories and the cluster center point of each of the N cluster categories are obtained by: acquiring a plurality of questions in a knowledge base; respectively converting the problems into sentence vectors to obtain a plurality of sentence vectors; clustering the sentence vectors to obtain N clustering results, wherein the N clustering results comprise N clustering categories and clustering center points of each clustering category in the N clustering categories, and N is a natural number greater than or equal to 2.

The first calculating unit 18 is configured to calculate a distance between the target sentence vector and a cluster center point of each of the N cluster categories, to obtain N distances.

The first screening unit 20 is configured to screen out a cluster category corresponding to a smallest distance among the N distances, to obtain a target cluster category.

The second calculating unit 22 is configured to calculate similarities between the target sentence vector and sentence vectors corresponding to all questions in the target cluster category, and obtain M similarities, where M is the number of questions in the target cluster category.

And a comparing unit 24, configured to compare the M similarities with preset similarity thresholds, respectively.

And a storage unit 26, configured to store the target problem in the knowledge base if the M similarities are less than or equal to the preset similarity threshold.

Optionally, the second computing unit 22 comprises: the system comprises a first determining subunit, a first calculating subunit, a first judging subunit and a second calculating subunit. And the first determining subunit is used for forming a first sentence vector set by the target sentence vector and sentence vectors corresponding to all the problems in the target cluster category. And the first calculating subunit is used for calculating the outlier of the first sentence vector set. And the first judging subunit is used for judging whether the target sentence vector is an outlier according to the calculation result of the outlier. And the second calculating subunit is used for respectively calculating the similarity between the target sentence vector and the sentence vectors corresponding to all the problems in the target cluster category if the target sentence vector is not an outlier.

Optionally, the apparatus further comprises: the device comprises a second screening unit, a determining unit and an output unit. And the second screening unit is used for screening out the problem corresponding to the similarity larger than the preset similarity threshold value if at least one of the M similarities is larger than the preset similarity threshold value. And the determining unit is used for taking the screened problems as similar problems of the target problems. The output unit is used for outputting prompt information, the prompt information is used for prompting similar problems existing in the user knowledge base, and the prompt information carries similar problems of the target problems.

Optionally, the determining unit includes: the system comprises a second determining subunit, a third calculating subunit, a second judging subunit and a third determining subunit. And the second determining subunit is used for forming sentence vectors corresponding to all the problems in the target cluster category into a second sentence vector set. And the third computation subunit is used for performing outlier computation on the second sentence vector set. And the second judging subunit is used for judging whether the screened problem is an outlier according to the result of the outlier calculation. And a third determining subunit, configured to, if the screened question is not an outlier, regard the screened question as a similar question of the target question.

Optionally, clustering the plurality of sentence vectors to obtain N clustering results, including: s1, determining an N value according to priori experience, wherein N is the cluster number of the clusters; s2, randomly selecting N sentence vectors as cluster center points of N cluster categories; s3, for the first sentence vector, calculating the distance between the first sentence vector and each cluster center point in the N cluster center points, classifying the first sentence vector into a category corresponding to the cluster center point closest to the first sentence vector, wherein the first sentence vector is any one of the remaining L-N sentence vectors, and L is the total number of the sentence vectors; s4, after all sentence vectors are classified, recalculating cluster center points of N categories according to the sentence vectors in each category, updating the cluster center points of the N categories, and circularly executing S3 and S4 until the distance between two adjacent cluster center points of each category in the N categories is within a preset distance.

Optionally, the first computing unit is configured to: according to the formulaCalculating the similarity between the target sentence vector and the clustering center point, wherein S represents the similarity between the target sentence vector and the clustering center pointSimilarity, A represents the target sentence vector, B represents the cluster center point, A _i The i element of the target sentence vector, B _i The i-th element representing the cluster center point, and n represents the number of elements contained in the target sentence vector.

In one aspect, an embodiment of the present invention provides a storage medium, where the storage medium includes a stored program, and when the program runs, controls a device where the storage medium is located to execute the following steps: acquiring a target problem; converting the target question into a target sentence vector; acquiring a pre-calculated cluster center point of each of N cluster categories, wherein the N cluster categories and the cluster center point of each of the N cluster categories are obtained through the following steps: acquiring a plurality of questions in a knowledge base; respectively converting the problems into sentence vectors to obtain a plurality of sentence vectors; clustering the sentence vectors to obtain N clustering results, wherein the N clustering results comprise N clustering categories and a clustering center point of each clustering category in the N clustering categories, and N is a natural number greater than or equal to 2; calculating the distance between the target sentence vector and the cluster center point of each cluster category in the N cluster categories to obtain N distances; screening out the cluster category corresponding to the smallest distance in the N distances to obtain a target cluster category; respectively calculating the similarity between the target sentence vector and sentence vectors corresponding to all problems in the target cluster category to obtain M similarity, wherein M is the number of the problems in the target cluster category; respectively comparing the M similarity with a preset similarity threshold; and if the M similarity is smaller than or equal to the preset similarity threshold, storing the target problem into a knowledge base.

Optionally, the device controlling the storage medium when the program runs further performs the following steps: forming a first sentence vector set by the target sentence vector and sentence vectors corresponding to all problems in the target cluster category; performing outlier calculation on the first sentence vector set; judging whether the target sentence vector is an outlier according to the result of the outlier calculation; if the target sentence vector is not an outlier, the similarity between the target sentence vector and the sentence vectors corresponding to all the problems in the target cluster class is calculated respectively.

Optionally, the device controlling the storage medium when the program runs further performs the following steps: if at least one of the M similarities is larger than a preset similarity threshold, screening out a problem corresponding to the similarity larger than the preset similarity threshold; taking the screened problems as similar problems of target problems; and outputting prompt information, wherein the prompt information is used for prompting similar problems existing in a user knowledge base, and the prompt information carries the similar problems of the target problems.

Optionally, the device controlling the storage medium when the program runs further performs the following steps: sentence vectors corresponding to all problems in the target cluster category are formed into a second sentence vector set; performing outlier calculation on the second sentence vector set; judging whether the screened problem is an outlier according to the result of outlier calculation; if the screened question is not an outlier, the screened question is taken as a similar question of the target question.

Optionally, the device controlling the storage medium when the program runs further performs the following steps: s1, determining an N value according to priori experience, wherein N is the cluster number of the clusters; s2, randomly selecting N sentence vectors as cluster center points of N cluster categories; s3, for the first sentence vector, calculating the distance between the first sentence vector and each cluster center point in the N cluster center points, classifying the first sentence vector into a category corresponding to the cluster center point closest to the first sentence vector, wherein the first sentence vector is any one of the remaining L-N sentence vectors, and L is the total number of the sentence vectors; s4, after all sentence vectors are classified, recalculating cluster center points of N categories according to the sentence vectors in each category, updating the cluster center points of the N categories, and circularly executing S3 and S4 until the distance between two adjacent cluster center points of each category in the N categories is within a preset distance.

Optionally, the device controlling the storage medium when the program runs further performs the following steps: according to the formula Calculating the similarity between the target sentence vector and the clustering center point, wherein S represents the similarity between the target sentence vector and the clustering center point, A represents the target sentence vector, B represents the clustering center point, A _i The i element of the target sentence vector, B _i The i-th element representing the cluster center point, and n represents the number of elements contained in the target sentence vector.

In one aspect, an embodiment of the present invention provides a computer device, including a memory for storing information including program instructions, and a processor for controlling execution of the program instructions, the program instructions when loaded and executed by the processor implementing the steps of: acquiring a target problem; converting the target question into a target sentence vector; acquiring a pre-calculated cluster center point of each of N cluster categories, wherein the N cluster categories and the cluster center point of each of the N cluster categories are obtained through the following steps: acquiring a plurality of questions in a knowledge base; respectively converting the problems into sentence vectors to obtain a plurality of sentence vectors; clustering the sentence vectors to obtain N clustering results, wherein the N clustering results comprise N clustering categories and a clustering center point of each clustering category in the N clustering categories, and N is a natural number greater than or equal to 2; calculating the distance between the target sentence vector and the cluster center point of each cluster category in the N cluster categories to obtain N distances; screening out the cluster category corresponding to the smallest distance in the N distances to obtain a target cluster category; respectively calculating the similarity between the target sentence vector and sentence vectors corresponding to all problems in the target cluster category to obtain M similarity, wherein M is the number of the problems in the target cluster category; respectively comparing the M similarity with a preset similarity threshold; and if the M similarity is smaller than or equal to the preset similarity threshold, storing the target problem into a knowledge base.

Optionally, the program instructions when loaded and executed by the processor further implement the steps of: forming a first sentence vector set by the target sentence vector and sentence vectors corresponding to all problems in the target cluster category; performing outlier calculation on the first sentence vector set; judging whether the target sentence vector is an outlier according to the result of the outlier calculation; if the target sentence vector is not an outlier, the similarity between the target sentence vector and the sentence vectors corresponding to all the problems in the target cluster class is calculated respectively.

Optionally, the program instructions when loaded and executed by the processor further implement the steps of: if at least one of the M similarities is larger than a preset similarity threshold, screening out a problem corresponding to the similarity larger than the preset similarity threshold; taking the screened problems as similar problems of target problems; and outputting prompt information, wherein the prompt information is used for prompting similar problems existing in a user knowledge base, and the prompt information carries the similar problems of the target problems.

Optionally, the program instructions when loaded and executed by the processor further implement the steps of: sentence vectors corresponding to all problems in the target cluster category are formed into a second sentence vector set; performing outlier calculation on the second sentence vector set; judging whether the screened problem is an outlier according to the result of outlier calculation; if the screened question is not an outlier, the screened question is taken as a similar question of the target question.

Optionally, the program instructions when loaded and executed by the processor further implement the steps of: s1, determining an N value according to priori experience, wherein N is the cluster number of the clusters; s2, randomly selecting N sentence vectors as cluster center points of N cluster categories; s3, for the first sentence vector, calculating the distance between the first sentence vector and each cluster center point in the N cluster center points, classifying the first sentence vector into a category corresponding to the cluster center point closest to the first sentence vector, wherein the first sentence vector is any one of the remaining L-N sentence vectors, and L is the total number of the sentence vectors; s4, after all sentence vectors are classified, recalculating cluster center points of N categories according to the sentence vectors in each category, updating the cluster center points of the N categories, and circularly executing S3 and S4 until the distance between two adjacent cluster center points of each category in the N categories is within a preset distance.

Optionally, the program instructions when loaded and executed by the processor further implement the steps of: according to the formula Calculating the similarity between the target sentence vector and the clustering center point, wherein S represents the similarity between the target sentence vector and the clustering center point, A represents the target sentence vector, B represents the clustering center point, A _i The i element of the target sentence vector, B _i The i-th element representing the cluster center point, and n represents the number of elements contained in the target sentence vector.

Fig. 3 is a schematic diagram of a computer device according to an embodiment of the present invention. As shown in fig. 3, the computer device 50 of this embodiment includes: the processor 51, the memory 52, and the computer program 53 stored in the memory 52 and capable of running on the processor 51, the computer program 53 when executed by the processor 51 implements the problem entry method in the embodiment, and is not described herein in detail to avoid repetition. Alternatively, the computer program, when executed by the processor 51, performs the functions of each model/unit in the problem-warehousing device in the embodiment, and is not described herein in detail to avoid repetition.

The computer device 50 may be a desktop computer, a notebook computer, a palm top computer, a cloud server, or the like. Computer devices may include, but are not limited to, a processor 51, a memory 52. It will be appreciated by those skilled in the art that fig. 3 is merely an example of computer device 50 and is not intended to limit computer device 50, and may include more or fewer components than shown, or may combine certain components, or different components, e.g., a computer device may also include an input-output device, a network access device, a bus, etc.

The processor 51 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 52 may be an internal storage unit of the computer device 50, such as a hard disk or memory of the computer device 50. The memory 52 may also be an external storage device of the computer device 50, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the computer device 50. Further, the memory 52 may also include both internal storage units and external storage devices of the computer device 50. The memory 52 is used to store computer programs and other programs and data required by the computer device. The memory 52 may also be used to temporarily store data that has been output or is to be output.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided in the present invention, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the elements is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in hardware plus software functional units.

The integrated units implemented in the form of software functional units described above may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium, and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a Processor (Processor) to perform part of the steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather to enable any modification, equivalent replacement, improvement or the like to be made within the spirit and principles of the invention.

Claims

1. A problem warehousing method, the method comprising:

acquiring a target problem;

converting the target question into a target sentence vector;

acquiring a pre-calculated cluster center point of each of N cluster categories, wherein the N cluster categories and the cluster center point of each of the N cluster categories are obtained through the following steps: acquiring a plurality of questions in a knowledge base; respectively converting the problems into sentence vectors to obtain a plurality of sentence vectors; clustering the sentence vectors to obtain N clustering results, wherein the N clustering results comprise N clustering categories and clustering center points of each clustering category in the N clustering categories, and N is a natural number greater than or equal to 2;

calculating the distance between the target sentence vector and the cluster center point of each cluster category in the N cluster categories to obtain N distances;

screening out the cluster category corresponding to the smallest distance in the N distances to obtain a target cluster category;

respectively calculating the similarity between the target sentence vector and sentence vectors corresponding to all problems in the target cluster category to obtain M similarity, wherein M is the number of the problems in the target cluster category;

Comparing the M similarity with a preset similarity threshold value respectively;

if the M similarity values are smaller than or equal to the preset similarity threshold value, storing the target problem into the knowledge base;

the calculating the similarity between the target sentence vector and the sentence vectors corresponding to all the problems in the target cluster category respectively includes:

forming a first sentence vector set by the target sentence vector and sentence vectors corresponding to all problems in the target cluster category;

performing outlier calculation on the first sentence vector set;

judging whether the target sentence vector is an outlier according to the result of outlier calculation;

if the target sentence vector is not an outlier, respectively calculating the similarity between the target sentence vector and sentence vectors corresponding to all problems in the target cluster category;

the clustering of the sentence vectors to obtain N clustering results includes:

s1, determining an N value according to priori experience, wherein N is the cluster number of the clusters;

s2, randomly selecting N sentence vectors as cluster center points of N cluster categories;

s3, for a first sentence vector, calculating the distance between the first sentence vector and each cluster center point in N cluster center points, and classifying the first sentence vector into a category corresponding to the cluster center point with the closest distance to the first sentence vector, wherein the first sentence vector is any one of the remaining L-N sentence vectors, and L is the total number of the sentence vectors;

S4, after all sentence vectors are classified, recalculating cluster center points of N categories according to the sentence vectors in each category, updating the cluster center points of the N categories,

and (3) circularly executing S3 and S4 until the distance between the adjacent cluster center points of each of the N categories is within a preset distance.

2. The method according to claim 1, wherein the method further comprises:

if at least one of the M similarities is larger than the preset similarity threshold, screening out a problem corresponding to the similarity larger than the preset similarity threshold;

taking the screened problems as similar problems of the target problems;

and outputting prompt information, wherein the prompt information is used for prompting a user that similar problems exist in the knowledge base, and the prompt information carries the similar problems of the target problems.

3. The method of any one of claims 1 to 2, wherein calculating a distance between the target sentence vector and the cluster center point comprises:

according to the formulaCalculating the similarity between the target sentence vector and the clustering center point, wherein S represents the similarity between the target sentence vector and the clustering center point, A represents the target sentence vector, B represents the clustering center point, A _i An i-th element representing the target sentence vector, B _i An ith element representing the cluster center, n representing that the target sentence vector containsIs a number of elements of (a).

4. A problem warehousing apparatus, the apparatus comprising:

a first acquisition unit configured to acquire a target problem;

a conversion unit for converting the target question into a target sentence vector;

a second obtaining unit, configured to obtain a cluster center point of each of N pre-computed cluster categories, where the N cluster categories and the cluster center point of each of the N cluster categories are obtained by: acquiring a plurality of questions in a knowledge base; respectively converting the problems into sentence vectors to obtain a plurality of sentence vectors; clustering the sentence vectors to obtain N clustering results, wherein the N clustering results comprise N clustering categories and clustering center points of each clustering category in the N clustering categories, and N is a natural number greater than or equal to 2;

the first calculation unit is used for calculating the distance between the target sentence vector and the cluster center point of each cluster category in the N cluster categories to obtain N distances;

The first screening unit is used for screening the cluster category corresponding to the smallest distance in the N distances to obtain a target cluster category;

the second calculation unit is used for calculating the similarity between the target sentence vector and sentence vectors corresponding to all problems in the target cluster category respectively to obtain M similarity, wherein M is the number of the problems in the target cluster category;

the comparison unit is used for comparing the M similarity with a preset similarity threshold value respectively;

a storage unit configured to store the target problem in the knowledge base if the M similarities are all less than or equal to the preset similarity threshold,

wherein the second computing unit includes:

a first determining subunit, configured to form a first sentence vector set from the target sentence vector and sentence vectors corresponding to all questions in the target cluster category;

the first calculating subunit is used for calculating outliers of the first sentence vector set;

the first judging subunit is used for judging whether the target sentence vector is an outlier according to the result of outlier calculation;

the second calculating subunit is used for respectively calculating the similarity between the target sentence vector and the sentence vectors corresponding to all the problems in the target cluster category if the target sentence vector is not an outlier;

The process of clustering the sentence vectors to obtain N clustering results includes:

5. The apparatus of claim 4, wherein the apparatus further comprises:

a second screening unit, configured to screen out a problem corresponding to a similarity greater than the preset similarity threshold if at least one of the M similarities is greater than the preset similarity threshold;

A determining unit, configured to take the screened problem as a similar problem of the target problem;

the output unit is used for outputting prompt information, the prompt information is used for prompting the user that similar problems exist in the knowledge base, and the prompt information carries the similar problems of the target problems.

6. A storage medium comprising a stored program, wherein the program, when run, controls a device in which the storage medium is located to perform the problem warehousing method of any one of claims 1 to 3.

7. A computer device comprising a memory for storing information including program instructions and a processor for controlling execution of the program instructions, characterized by: the program instructions, when loaded and executed by a processor, implement the steps of the problem warehousing method of any one of claims 1 to 3.