CN116910259A - Knowledge diagnosis method and device for knowledge base - Google Patents

Knowledge diagnosis method and device for knowledge base Download PDF

Info

Publication number
CN116910259A
CN116910259A CN202311168291.0A CN202311168291A CN116910259A CN 116910259 A CN116910259 A CN 116910259A CN 202311168291 A CN202311168291 A CN 202311168291A CN 116910259 A CN116910259 A CN 116910259A
Authority
CN
China
Prior art keywords
text
knowledge
determining
characterization model
distance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311168291.0A
Other languages
Chinese (zh)
Other versions
CN116910259B (en
Inventor
武文杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Xumi Yuntu Space Technology Co Ltd
Original Assignee
Shenzhen Xumi Yuntu Space Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Xumi Yuntu Space Technology Co Ltd filed Critical Shenzhen Xumi Yuntu Space Technology Co Ltd
Priority to CN202311168291.0A priority Critical patent/CN116910259B/en
Publication of CN116910259A publication Critical patent/CN116910259A/en
Application granted granted Critical
Publication of CN116910259B publication Critical patent/CN116910259B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a knowledge diagnosis method and device of a knowledge base. The method comprises the following steps: determining at least 2 question texts associated with knowledge points in a knowledge base; determining a distance parameter of a clustering algorithm based on the target language characterization model; clustering calculation is carried out on each problem text associated with the knowledge points by using a clustering algorithm after the distance parameters are determined; to determine abnormal text from the question text; the distance between the abnormal text and all adjacent problem texts is larger than the distance parameter. The application can restrict the detection range within a single knowledge by a high-efficiency, complete and loose knowledge diagnosis method with specific service, and the distance parameter is obtained by determining the clustering algorithm by utilizing the target language characterization model, so that the complicated step of manually adjusting parameters is omitted.

Description

Knowledge diagnosis method and device for knowledge base
Technical Field
The application relates to the technical field of intelligent decision making, in particular to a knowledge diagnosis method and device of a knowledge base.
Background
The knowledge base is a structured, easy-to-operate, easy-to-use and comprehensive and organized knowledge cluster in knowledge engineering, and is a set of interconnected knowledge pieces stored, organized, managed and used in a computer memory by adopting a certain (or a plurality of) knowledge representation modes aiming at the need of solving a problem in a certain (or a certain) field.
The intelligent customer service operation of the user is generally a question-answering system based on a knowledge base. The questions and corresponding answers in the question-answering system of the knowledge base are required to be edited manually in advance, and because of the crossover among business scenes, each business scene can be associated with a plurality of knowledge bases, and different business scenes can be associated with the same knowledge base. Therefore, with the development of business, the knowledge quantity is gradually expanded, and the data quality is reduced, so that the error labeling and the data conflict are more and more.
In the prior art, a high-efficiency diagnosis method for a knowledge base is lacking, and screening and auditing are only carried out on abnormal knowledge data through manpower or based on data rules and confidence learning, but the knowledge base is large in data complexity and low in difficulty and efficiency.
Disclosure of Invention
In view of the above, the embodiment of the application provides a knowledge diagnosis method and device for a knowledge base, which are used for solving the problems of large difficulty and low efficiency in screening abnormal knowledge data due to numerous and complicated knowledge base data in the prior art.
In a first aspect of an embodiment of the present application, there is provided a knowledge diagnosis method of a knowledge base, including:
determining at least 2 question texts associated with knowledge points in a knowledge base;
determining a distance parameter of a clustering algorithm based on the target language characterization model;
clustering calculation is carried out on each problem text associated with the knowledge points by using a clustering algorithm after the distance parameters are determined; to determine abnormal text from the question text;
the distance between the abnormal text and all adjacent problem texts is larger than the distance parameter.
In a second aspect of the embodiment of the present application, there is provided a knowledge diagnosis apparatus for a knowledge base, including:
a question text determining module for determining at least 2 question texts associated with the knowledge points in the knowledge base;
the distance parameter determining module is used for determining distance parameters of the clustering algorithm based on the target language characterization model;
the abnormal text determining module is used for carrying out clustering calculation on each problem text associated with the knowledge point by using a clustering algorithm after the distance parameter is determined; to determine abnormal text from the question text;
the distance between the abnormal text and all adjacent problem texts is larger than the distance parameter.
In a third aspect of the embodiments of the present application, there is provided an electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above method when executing the computer program.
In a fourth aspect of the embodiments of the present application, there is provided a computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the above method.
Compared with the prior art, the embodiment of the application has the beneficial effects that: according to the embodiment of the application, the distance parameters of the clustering algorithm are determined based on the target language characterization model, and the clustering algorithm after the distance parameters are determined is used for carrying out clustering calculation on each problem text associated with the knowledge points so as to determine the abnormal text from the problem text. The method can restrict the detection range within a single knowledge by using a high-efficiency, complete and loose-coupling knowledge diagnosis method with specific business, determine the distance parameters of a clustering algorithm by using a target language characterization model, and save the complicated step of manual parameter adjustment.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic view of an application scenario according to an embodiment of the present application;
FIG. 2 is a schematic flow chart of a knowledge diagnosis method of a knowledge base according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a knowledge diagnosis apparatus for a knowledge base according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
The knowledge base is a structured, easy-to-operate, easy-to-use and comprehensive and organized knowledge cluster in knowledge engineering, and is a set of interconnected knowledge pieces stored, organized, managed and used in a computer memory by adopting a certain (or a plurality of) knowledge representation modes aiming at the need of solving a problem in a certain (or a certain) field.
The intelligent customer service operation of the user is generally a question-answering system based on a knowledge base. The question and answer system can lighten the burden of customer service personnel, provide answers for most common questions and improve the participation degree and satisfaction. The questions and corresponding answers in the question-answering system of the knowledge base are required to be edited manually in advance, and because of the crossover among business scenes, each business scene can be associated with a plurality of knowledge bases, and different business scenes can be associated with the same knowledge base. Therefore, with the development of business, the knowledge quantity is gradually expanded, and the data quality is reduced, so that the error labeling and the data conflict are more and more.
In the prior art, screening and auditing are performed on the abnormal knowledge data manually or based on data rules and confidence learning. A data rule-based method: the requirement on rule design is high, and the rules are expanded more and more along with the increase of data, so that the maintenance difficulty is high; confidence learning based method: a. the confidence learning needs to establish joint probability distribution of the noise label and the real label, and the probability distribution of different services is different, but the confidence learning can contain the same knowledge base, so that the same knowledge can be in different probability distribution; b. the similar questions under the same knowledge in the actual business scene are not necessarily all the same meaning, are often organized according to answers, and are not suitable for confidence learning. Both of these factors affect the diagnostic effect. The manual auditing is too difficult for a knowledge base with complex logarithmic data.
In summary, in the prior art, a high-efficiency diagnosis method for a knowledge base is lacking, and the knowledge diagnosis difficulty for the knowledge base is high and low because the knowledge base data is numerous and complex.
In view of the above problems in the prior art, the embodiment of the application provides a brand-new knowledge diagnosis method for a knowledge base, which determines distance parameters of a clustering algorithm based on a target language characterization model, and performs clustering calculation on each problem text associated with a knowledge point by using the clustering algorithm after determining the distance parameters so as to determine an abnormal text from the problem text. The method can restrict the detection range within a single knowledge by a high-efficiency, complete and loose-coupling knowledge diagnosis method with specific business, and can save the complicated step of manually adjusting parameters by determining the distance parameters of a clustering algorithm by utilizing a target language characterization model.
A knowledge diagnosis method and apparatus for a knowledge base according to an embodiment of the present application will be described in detail with reference to the accompanying drawings.
Fig. 1 is a schematic view of an application scenario according to an embodiment of the present application. The application scenario may include terminal devices 101, 102 and 103, server 104, network 105.
The terminal devices 101, 102, and 103 may be hardware or software. When the terminal devices 101, 102, and 103 are hardware, they may be various electronic devices having a display screen and supporting communication with the server 104, including but not limited to smartphones, tablets, laptop and desktop computers, etc.; when the terminal devices 101, 102, and 103 are software, they may be installed in the electronic device as above. Terminal devices 101, 102, and 103 may be implemented as multiple software or software modules, or as a single software or software module, as embodiments of the application are not limited in this regard. Further, various applications, such as a data processing application, an instant messaging tool, social platform software, a search class application, a shopping class application, and the like, may be installed on the terminal devices 101, 102, and 103.
The server 104 may be a server that provides various services, for example, a background server that receives a request transmitted from a terminal device with which communication connection is established, and the background server may perform processing such as receiving and analyzing the request transmitted from the terminal device and generate a processing result. The server 104 may be a server, a server cluster formed by a plurality of servers, or a cloud computing service center, which is not limited in this embodiment of the present application.
The server 104 may be hardware or software. When the server 104 is hardware, it may be various electronic devices that provide various services to the terminal devices 101, 102, and 103. When the server 104 is software, it may be a plurality of software or software modules providing various services to the terminal devices 101, 102, and 103, or may be a single software or software module providing various services to the terminal devices 101, 102, and 103, which is not limited in this embodiment of the present application.
The network 105 may be a wired network using coaxial cable, twisted pair and optical fiber connection, or may be a wireless network that can implement interconnection of various communication devices without wiring, for example, bluetooth (Bluetooth), near field communication (Near Field Communication, NFC), infrared (Infrared), etc., which are not limited by the embodiment of the present application.
The user can establish a communication connection with the server 104 via the network 105 through the terminal devices 101, 102, and 103 to receive or transmit information or the like. Specifically, server 104 determines at least 2 question texts associated with knowledge points in the knowledge base; the server 104 determines a distance parameter of the clustering algorithm based on the target language characterization model; the server 104 performs clustering calculation on each problem text associated with the knowledge points by using a clustering algorithm after determining the distance parameters; to determine abnormal text from the question text; the distance between the abnormal text and all adjacent problem texts is larger than the distance parameter.
It should be noted that the specific types, numbers and combinations of the terminal devices 101, 102 and 103, the server 104 and the network 105 may be adjusted according to the actual requirements of the application scenario, which is not limited in the embodiment of the present application.
Fig. 2 is a flow chart of a knowledge diagnosis method of a knowledge base according to an embodiment of the present application. The knowledge diagnosis method of the knowledge base of fig. 2 may be performed by the terminal device or the server of fig. 1. As shown in fig. 2, the knowledge diagnosis method of the knowledge base includes:
s201, determining at least 2 question texts associated with knowledge points in a knowledge base;
s202, determining distance parameters of a clustering algorithm based on a target language characterization model;
s203, clustering calculation is carried out on each problem text associated with the knowledge points by using a clustering algorithm after the distance parameters are determined; to determine abnormal text from the question text; the distance between the abnormal text and all adjacent problem texts is larger than the distance parameter.
Specifically, the knowledge base in this embodiment refers to a knowledge base that needs to be diagnosed, and there may be problems of data quality degradation, error labeling, and data collision in the knowledge base, so that it is necessary to screen data in the knowledge base and pick out abnormal data for processing. The knowledge points may be answers or intentions in the knowledge base, and in general, the data stored in the knowledge base is in a form of a question-answer, and each answer may have a plurality of similar questions due to different scenes or services, and the similar questions are called similar questions, that is, the text of the questions in the embodiment. Each knowledge point is associated with a plurality of question texts, one question text being subordinate to only one knowledge point. After the knowledge base is put into the intelligent customer service system for use, the user sends out a question, the intelligent customer service system further determines corresponding intention according to the question matching similarity question, and searches corresponding reply text according to the intention, and feeds the reply text back to the user in the form of instant message, so that the intelligent customer service system plays a role of a customer service robot.
However, in the prior art, in order to embody individuation and expand the intelligence of the system, the system also allows the user to customize the similarity questions and the association relationship between the custom similarity questions and the intention. With the increase of data records in a knowledge base, more and more similar questions are displayed, the mapping relationship between the similar questions and the intentions is more and more complex, and when the knowledge base is built, the data are manually marked, so that the mapping relationship between the similar questions and the intentions is sometimes incorrect or can be understood though the correctness of the mapping relationship is still ambiguous due to the difference in subjective understanding of characters. Therefore, the knowledge base may have the problems of data quality degradation, error labeling and data collision.
To solve this problem in the prior art, a diagnostic mechanism needs to be introduced to screen the knowledge base for abnormal data or abnormal text. Therefore, the embodiment adopts a clustering algorithm to realize the screening of the abnormal text.
Specifically, the clustering algorithm in the embodiment may be a DBSCAN algorithm, and the algorithm is a clustering algorithm based on a density space, and has wide application in the fields of machine learning and data mining, and the clustering principle is popular, that is, the density of each cluster is higher than the density around the cluster, and the density of noise is lower than the density of any cluster. Noise is the abnormal text in the embodiment, because the distance parameter in the clustering algorithm is key, too large distance parameter can result in insufficient thorough and accurate screening of the problem text, and larger error. Too small a distance parameter can result in screening out normal problem text, resulting in erroneous judgment. The distance parameter, which may also be referred to as the scan radius, is the minimum distance between two samples, and is defined as: if the distance between two samples is less than or equal to the value distance parameter, then the two samples are neighbors of each other, that is, if the distance between two question texts is less than or equal to the distance parameter, then the two question texts are a cluster class; if the distance between two question texts is greater than the distance parameter, one of the question texts is subject to the other question text belonging to the abnormal text. It is difficult to select an appropriate distance parameter.
Thus, the present embodiment determines distance parameters of a clustering algorithm that diagnoses a knowledge base based on a target language characterization model. Specifically, the data in the knowledge base is adopted to carry out artificial intelligence training on the initial language characterization model so as to obtain the target language characterization model. The initial language characterization model may employ a BERT model that employs a neural network structure for constructing question-answering tasks or language reasoning tasks, which are not described in detail herein.
The training process of the initial language characterization model specifically comprises the steps of training a classification task and a similar task on the initial language characterization model according to knowledge data in a knowledge base to obtain a classification loss result and a similar loss result, reversely updating model parameters of the initial language characterization model according to the classification loss result and the similar loss result, training the initial language characterization model by taking the data in the knowledge base as a training set, and enabling the distance between the problem texts to be shortened, namely, the distance between the text and the text which is not considered by the problem texts to be lengthened, so that the characteristics of the problem texts are maximized and more attention of a network is obtained. Therefore, the recognition degree of the problem text can be improved during cluster analysis. Training the initial language characterization model through iteration, gradually fitting the initial language characterization model with data in a knowledge base, testing the trained initial language characterization model through a test set, and determining that the optimal initial language characterization model is a target language characterization model. The target language model fits the raw data in the knowledge base to the greatest extent, i.e., the target language characterization model exists to fit the raw data in all knowledge bases.
And testing the target language characterization model by using a test set, and calculating the mean square error of the loss of all samples in the test set, wherein the mean square error is the distance parameter. Mean square error refers to the average of the squares of the differences between the predicted and actual values after the test set passes the target language characterization model test. Specifically, assuming we have a true value of y and a predicted value of y ', the mean square error is the average of (y-y'). Sup.2. The mean square error can evaluate the performance of the target language characterization model, and the distance relation between text categories in the knowledge base can be well reflected by the mean square error because the target language characterization model is most fit with the original data in the knowledge base. The mean square error is used as a distance parameter, so that a clustering algorithm can realize more accurate classification screening on the problem text.
Further, clustering calculation is carried out on each problem text associated with the knowledge points by using a clustering algorithm after the distance parameters are determined; to determine abnormal text from the question text; the distance between the abnormal text and all adjacent problem texts is larger than the distance parameter. The abnormal text is the text with wrong labels or data conflicts in the knowledge base. Specifically, the distance between text vectors of each piece of problem text data is calculated through a clustering algorithm, the distance between any two adjacent problem texts is within a distance parameter, the two problem texts are the same as one cluster, and when the distance between a certain problem text and all the adjacent problem texts is larger than the distance parameter, the problem text is an abnormal text.
According to the technical scheme provided by the embodiment of the application, the distance parameters of the clustering algorithm are determined based on the target language characterization model, and the clustering algorithm after the distance parameters are determined is used for carrying out clustering calculation on each problem text associated with the knowledge points so as to determine the abnormal text from the problem text. The method can restrict the detection range within a single knowledge by a high-efficiency, complete and loose-coupling knowledge diagnosis method with specific business, and can save the complicated step of manually adjusting parameters by determining the distance parameters of a clustering algorithm by utilizing a target language characterization model.
In some embodiments, further comprising:
training the classification task of the initial language characterization model according to knowledge data in the knowledge base to output a classification loss result;
training the initial language characterization model for similar tasks according to knowledge data in a knowledge base to output a similarity loss result;
superposing the classification loss result and the similarity loss result to determine a total loss result;
reversely adjusting model parameters of the initial language characterization model according to the total loss result and iteratively training the initial language characterization model;
testing the initial language characterization model by using the test set and outputting a first test result;
and when the first test result meets the preset requirement, determining the initial language characterization model of the last test as the target language characterization model.
Specifically, the embodiment is a specific training process of the initial language characterization model, which may also be referred to as a fitting process of the initial language characterization model and knowledge data in the knowledge base. According to knowledge data in a knowledge base, carrying out classification task training and similar task training on an initial language characterization model to obtain a classification loss result and a similar loss result, reversely updating model parameters of the initial language characterization model by the classification loss result and the similar loss result, training the initial language characterization model by taking the data in the knowledge base as a training set, and enabling the distance between the problem texts to be shortened, namely, the distance between the text and the text which is not agreed with the problem texts to be lengthened, so that the characteristics of the problem texts are maximized and more network attention is obtained. Therefore, the recognition degree of the problem text can be improved during cluster analysis. Training the initial language characterization model through iteration, gradually fitting the initial language characterization model with data in a knowledge base, testing the trained initial language characterization model through a test set, and determining that the optimal initial language characterization model is a target language characterization model. The target language model fits the raw data in the knowledge base to the greatest extent, i.e., the target language characterization model exists to fit the raw data in all knowledge bases.
Further, the test set is data for verifying the effect of the model, and the first test result is used for evaluating whether the initial language characterization model is learned well or not. Test centralizationThe data of the test set is collected in the knowledge base, and the data in the test set and the data used in training are the same, so that the evaluation of the initial language characterization model is more convincing. In general, the data in the knowledge base may be as per 4:1 division of whereinCan be used for training, ++>Can be used for testing. Training the initial language characterization model through iteration, testing the initial language characterization model by adopting a test set after training, and outputting a first test result. And when the first test result meets the preset requirement, determining the initial language characterization model of the last test as the target language characterization model. The preset requirement can be an expected threshold of the accuracy of the test set, and when the expected threshold is reached, it is indicated that the predictive capability or the judging capability of the target language characterization model on the data in the knowledge base meets the preset requirement, and it is also indicated that the target language characterization model fits the original data in the knowledge base to the greatest extent.
In some embodiments, determining the distance parameter of the clustering algorithm based on the target language characterization model comprises:
determining labeling information of each sample in the test set;
and determining the distance parameter according to the second test result of the target language characterization model and the labeling information.
Specifically, the second test result is a test result of the test set on the target language characterization model, a test result of the test set on the initial target language characterization model last time, and the first test result comprises the second test result. The labeling information defines the meaning of each sample or data in the test set, and the loss of the data in the knowledge base can be basically determined according to the second test result and the labeling information, because the target language characterization model is the language characterization model which is trained by the classification task and the similar task and is most fit with the original data in the knowledge base, the loss also represents the category difference between the original data or each text in the knowledge base, and the distance parameter of the clustering algorithm is determined according to the category difference of the original data in the knowledge base, so that the clustering algorithm is more accurate.
In some embodiments, determining the distance parameter from the second test result and the labeling information of the target language characterization model includes:
calculating the mean square error of the second test results of all samples in the test set and the corresponding labeling information;
and determining a distance parameter according to the mean square error.
Specifically, the mean square error refers to the average value of the squares of the differences between the predicted values and the true values after the test set passes the target language characterization model test. Specifically, assuming we have a true value of y and a predicted value of y ', the mean square error is the average of (y-y'). Sup.2. The mean square error can evaluate the performance of the target language characterization model, and the distance relation between text categories in the knowledge base can be well reflected by the mean square error because the target language characterization model is most fit with the original data in the knowledge base. The mean square error is used as a distance parameter, so that a clustering algorithm can realize more accurate classification screening on the problem text.
In some embodiments, clustering calculation is performed on each problem text associated with the knowledge points by using a clustering algorithm after the distance parameters are determined; to determine abnormal text from the question text; the distances between the abnormal text and the adjacent problem text are larger than the distance parameters, and the method comprises the following steps:
determining whether each problem text has adjacent text within the distance parameter by using a clustering algorithm;
and when any question text does not exist adjacent text within the distance parameter, determining the question text as an abnormal text.
Specifically, the distance parameter reflects the distance between two adjacent question texts, and the distance between any two adjacent question texts is not greater than the distance parameter, so that one question text is taken as a reference, the other question text belongs to a nearby text, and further the nearby text and the question text in the distance parameter can be judged to belong to the same cluster. The distance parameter is a standard for dividing whether the problem text belongs to the same cluster class, and when any problem text does not exist adjacent text within the distance parameter, the problem text is determined to be an abnormal text. At this time, the distance between the abnormal text and any cluster is relatively large, and the distance parameter value is exceeded, so that the noise degree of the corresponding text data is also relatively large.
In some embodiments, further comprising:
determining the attribution range of the abnormal text;
and determining a processing mode of the abnormal text according to the attribution range.
Specifically, after the abnormal text is screened out, the abnormal text should be processed to thoroughly solve the problems of data quality degradation, error labeling and data conflict. Different types of abnormal texts have different processing modes, so that the attribution range of the abnormal texts needs to be determined. This home scope can colloquially understand whether the abnormal text belongs to benign or malignant. The method aims at the processing modes of different levels or different weights of the abnormal texts with different attribution ranges.
In some embodiments, if the home range includes a white list and the home range includes a white list, determining a processing manner of the abnormal text according to the home range includes:
if the abnormal text is within the white list, reassigning a new label to the abnormal text in a searching and clustering mode;
and if the abnormal text is outside the white list, manually repairing and verifying the abnormal text.
In particular, text data in the knowledge base may incorporate a whitelist mechanism at the beginning of the build. The text data in the white list can be referred and passed through preferentially after repeated verification, so that the safety and convenience of the knowledge base are improved. If the abnormal text is within the white list, the answer of the abnormal text corresponding to other question text is one, but the abnormal text belongs to different kinds of services, and the label of the abnormal text cannot be corrected, so that the abnormal text is screened out in the diagnosis process. For such abnormal texts, the retrieval can be carried out again by the key times of the abnormal texts, the retrieved results are clustered, and the cluster class to which the abnormal text belongs is determined, so that new labels are reassigned.
Further, if the abnormal text is outside the white list, it is indicated that the abnormal text is likely to be caused by the error labeling. The abnormal text can be repaired manually, the labeling information of the abnormal text is corrected, and the abnormal text is repeatedly verified to be recovered to be a normal problem text and can be added into the white list again.
Any combination of the above optional solutions may be adopted to form an optional embodiment of the present application, which is not described herein.
The following are examples of the apparatus of the present application that may be used to perform the method embodiments of the present application. For details not disclosed in the embodiments of the apparatus of the present application, please refer to the embodiments of the method of the present application.
Fig. 3 is a schematic diagram of a knowledge diagnosis apparatus for a knowledge base according to an embodiment of the present application. As shown in fig. 3, the knowledge diagnosis apparatus of the knowledge base includes:
a question text determination module 301 configured to determine at least 2 question texts associated with knowledge points in a knowledge base;
a distance parameter determination module 302 configured to determine a distance parameter of the clustering algorithm based on the target language characterization model;
an abnormal text determining module 303, configured to perform clustering calculation on each question text associated with the knowledge points by using a clustering algorithm after determining the distance parameter; to determine abnormal text from the question text;
the distance between the abnormal text and all adjacent problem texts is larger than the distance parameter.
In some embodiments, the distance parameter determination module 302 of fig. 3 further includes:
training the classification task of the initial language characterization model according to knowledge data in the knowledge base to output a classification loss result;
training the initial language characterization model for similar tasks according to knowledge data in a knowledge base to output a similarity loss result;
superposing the classification loss result and the similarity loss result to determine a total loss result;
reversely adjusting model parameters of the initial language characterization model according to the total loss result and iteratively training the initial language characterization model;
testing the initial language characterization model by using a test set and outputting a first test result;
and when the first test result meets the preset requirement, determining the initial language characterization model of the last test as the target language characterization model.
In some embodiments, the distance parameter determination module 302 of fig. 3 includes:
determining labeling information of each sample in the test set;
and determining the distance parameter according to the second test result of the target language characterization model and the labeling information.
In some embodiments, the distance parameter determination module 302 of fig. 3 includes:
calculating the mean square error of the second test results of all samples in the test set and the corresponding labeling information;
and determining a distance parameter according to the mean square error.
In some embodiments, the outlier text determination module 303 of fig. 3 includes:
determining whether each problem text has adjacent text within the distance parameter by using a clustering algorithm;
and when any question text does not exist adjacent text within the distance parameter, determining the question text as an abnormal text.
In some embodiments, the outlier text determination module 303 of fig. 3 further comprises:
determining the attribution range of the abnormal text;
and determining a processing mode of the abnormal text according to the attribution range.
In some embodiments, the home range includes within the whitelist and outside the whitelist, and the abnormal text determination module 303 of fig. 3 includes:
if the abnormal text is within the white list, reassigning a new label to the abnormal text in a searching and clustering mode;
and if the abnormal text is outside the white list, manually repairing and verifying the abnormal text.
It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present application.
Fig. 4 is a schematic diagram of an electronic device 4 according to an embodiment of the present application. As shown in fig. 4, the electronic apparatus 4 of this embodiment includes: a processor 401, a memory 402 and a computer program 403 stored in the memory 402 and executable on the processor 401. The steps of the various method embodiments described above are implemented by processor 401 when executing computer program 403. Alternatively, the processor 401, when executing the computer program 403, performs the functions of the modules/units in the above-described apparatus embodiments.
The electronic device 4 may be a desktop computer, a notebook computer, a palm computer, a cloud server, or the like. The electronic device 4 may include, but is not limited to, a processor 401 and a memory 402. It will be appreciated by those skilled in the art that fig. 4 is merely an example of the electronic device 4 and is not limiting of the electronic device 4 and may include more or fewer components than shown, or different components.
The processor 401 may be a central processing unit (Central Processing Unit, CPU) or other general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.
The memory 402 may be an internal storage unit of the electronic device 4, for example, a hard disk or a memory of the electronic device 4. The memory 402 may also be an external storage device of the electronic device 4, for example, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the electronic device 4. Memory 402 may also include both internal storage units and external storage devices of electronic device 4. The memory 402 is used to store computer programs and other programs and data required by the electronic device.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit.
The integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present application may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, and the computer program may be stored in a computer readable storage medium, where the computer program, when executed by a processor, may implement the steps of each of the method embodiments described above. The computer program may comprise computer program code, which may be in source code form, object code form, executable file or in some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the content of the computer readable medium can be appropriately increased or decreased according to the requirements of the jurisdiction's jurisdiction and the patent practice, for example, in some jurisdictions, the computer readable medium does not include electrical carrier signals and telecommunication signals according to the jurisdiction and the patent practice.
The above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims (10)

1. A knowledge diagnosis method of a knowledge base, the method comprising:
determining at least 2 question texts associated with knowledge points in a knowledge base;
determining a distance parameter of a clustering algorithm based on the target language characterization model;
clustering calculation is carried out on each question text associated with the knowledge points by using a clustering algorithm after the distance parameters are determined; to determine an abnormal text from the question text;
the distance between the abnormal text and all the adjacent problem texts is larger than the distance parameter.
2. The method as recited in claim 1, further comprising:
training the classification task of the initial language characterization model according to knowledge data in the knowledge base to output a classification loss result;
training the initial language characterization model for similar tasks according to knowledge data in the knowledge base to output a similar loss result;
superposing the classification loss result and the similar loss result to determine a total loss result;
reversely adjusting model parameters of an initial language characterization model according to the total loss result and iteratively training the initial language characterization model;
testing the initial language characterization model by using a test set and outputting a first test result;
and when the first test result meets the preset requirement, determining the initial language characterization model of the last test as the target language characterization model.
3. The method of claim 2, wherein determining distance parameters of a clustering algorithm based on a target language characterization model comprises:
determining labeling information of each sample in the test set;
and determining the distance parameter according to the second test result of the target language characterization model and the labeling information.
4. The method of claim 3, wherein said determining the distance parameter from the second test result of the target language characterization model and the labeling information comprises:
calculating the mean square error of the second test results of all samples in the test set and the corresponding labeling information;
and determining the distance parameter according to the mean square error.
5. The method according to claim 1, wherein the clustering algorithm after determining the distance parameter is used to perform clustering calculation on each question text associated with the knowledge point; to determine an abnormal text from the question text; the distance between the abnormal text and all adjacent problem texts is larger than the distance parameter, which comprises the following steps:
determining whether each question text exists adjacent text within the distance parameter by using the clustering algorithm;
and when any question text does not exist in the adjacent text within the distance parameter, determining the question text as an abnormal text.
6. The method of any one of claims 1-5, further comprising:
determining the attribution range of the abnormal text;
and determining a processing mode of the abnormal text according to the attribution range.
7. The method of claim 6, wherein the home range includes within a whitelist and outside of a whitelist, and wherein determining the manner of processing the exception text based on the home range comprises:
if the abnormal text is within the white list, reassigning a new label to the abnormal text in a searching and clustering mode;
and if the abnormal text is outside the white list, manually repairing and verifying the abnormal text.
8. A knowledge diagnosis apparatus of a knowledge base, comprising:
a question text determining module for determining at least 2 question texts associated with the knowledge points in the knowledge base;
the distance parameter determining module is used for determining distance parameters of the clustering algorithm based on the target language characterization model;
the abnormal text determining module is used for carrying out clustering calculation on each problem text associated with the knowledge point by using a clustering algorithm after the distance parameter is determined; to determine an abnormal text from the question text;
the distance between the abnormal text and all the adjacent problem texts is larger than the distance parameter.
9. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 7 when the computer program is executed.
10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 7.
CN202311168291.0A 2023-09-12 2023-09-12 Knowledge diagnosis method and device for knowledge base Active CN116910259B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311168291.0A CN116910259B (en) 2023-09-12 2023-09-12 Knowledge diagnosis method and device for knowledge base

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311168291.0A CN116910259B (en) 2023-09-12 2023-09-12 Knowledge diagnosis method and device for knowledge base

Publications (2)

Publication Number Publication Date
CN116910259A true CN116910259A (en) 2023-10-20
CN116910259B CN116910259B (en) 2024-04-16

Family

ID=88353486

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311168291.0A Active CN116910259B (en) 2023-09-12 2023-09-12 Knowledge diagnosis method and device for knowledge base

Country Status (1)

Country Link
CN (1) CN116910259B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111813910A (en) * 2020-06-24 2020-10-23 平安科技(深圳)有限公司 Method, system, terminal device and computer storage medium for updating customer service problem
US20200349183A1 (en) * 2019-05-03 2020-11-05 Servicenow, Inc. Clustering and dynamic re-clustering of similar textual documents
CN113297291A (en) * 2021-05-08 2021-08-24 上海电气风电集团股份有限公司 Monitoring method, monitoring system, readable storage medium and wind driven generator
CN115935229A (en) * 2022-11-22 2023-04-07 歌尔股份有限公司 Product abnormity detection method, device, equipment and storage medium
CN116383724A (en) * 2023-02-16 2023-07-04 北京数美时代科技有限公司 Single-domain label vector extraction method and device, electronic equipment and medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200349183A1 (en) * 2019-05-03 2020-11-05 Servicenow, Inc. Clustering and dynamic re-clustering of similar textual documents
CN111813910A (en) * 2020-06-24 2020-10-23 平安科技(深圳)有限公司 Method, system, terminal device and computer storage medium for updating customer service problem
CN113297291A (en) * 2021-05-08 2021-08-24 上海电气风电集团股份有限公司 Monitoring method, monitoring system, readable storage medium and wind driven generator
CN115935229A (en) * 2022-11-22 2023-04-07 歌尔股份有限公司 Product abnormity detection method, device, equipment and storage medium
CN116383724A (en) * 2023-02-16 2023-07-04 北京数美时代科技有限公司 Single-domain label vector extraction method and device, electronic equipment and medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
查鲁.C.阿加沃尔: "《数据挖掘 原理与实践 基础篇》", 北京:机械工业出版社, pages: 193 - 195 *

Also Published As

Publication number Publication date
CN116910259B (en) 2024-04-16

Similar Documents

Publication Publication Date Title
CN111444952B (en) Sample recognition model generation method, device, computer equipment and storage medium
CN108197652B (en) Method and apparatus for generating information
CN111667267A (en) Block chain transaction risk identification method and device
CN114418035A (en) Decision tree model generation method and data recommendation method based on decision tree model
CN111796957B (en) Transaction abnormal root cause analysis method and system based on application log
CN111340233B (en) Training method and device of machine learning model, and sample processing method and device
CN114546975B (en) Business risk processing method and server combining artificial intelligence
CN114328277A (en) Software defect prediction and quality analysis method, device, equipment and medium
CN116910274B (en) Test question generation method and system based on knowledge graph and prediction model
CN116910259B (en) Knowledge diagnosis method and device for knowledge base
CN116823164A (en) Business approval method, device, equipment and storage medium
CN109241249B (en) Method and device for determining burst problem
CN116861358A (en) BP neural network and multi-source data fusion-based computing thinking evaluation method
CN114970670A (en) Model fairness assessment method and device
CN115130536A (en) Training method of feature extraction model, data processing method, device and equipment
CN116911313B (en) Semantic drift text recognition method and device
CN114722061B (en) Data processing method and device, equipment and computer readable storage medium
CN115080445B (en) Game test management method and system
CN116701962B (en) Edge data processing method, device, computing equipment and storage medium
US11398161B1 (en) Systems and methods for detecting unusually frequent exactly matching and nearly matching test responses
CN115545580B (en) Medical training process standardization verification method and system
CN117171141B (en) Data model modeling method based on relational graph
CN112598118B (en) Method, device, storage medium and equipment for processing abnormal labeling in supervised learning
CN117131432A (en) Risk account identification method, related device and medium
CN117312933A (en) Value classification method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant