WO2024000822A1 - 文本分类标注样本的异常检测方法、装置、设备及介质 - Google Patents

文本分类标注样本的异常检测方法、装置、设备及介质 Download PDF

Info

Publication number
WO2024000822A1
WO2024000822A1 PCT/CN2022/118488 CN2022118488W WO2024000822A1 WO 2024000822 A1 WO2024000822 A1 WO 2024000822A1 CN 2022118488 W CN2022118488 W CN 2022118488W WO 2024000822 A1 WO2024000822 A1 WO 2024000822A1
Authority
WO
WIPO (PCT)
Prior art keywords
annotation data
cluster
classification
text classification
classification annotation
Prior art date
Application number
PCT/CN2022/118488
Other languages
English (en)
French (fr)
Inventor
张健
王子豪
王子
唐家英
陈运文
纪达麒
Original Assignee
达而观信息科技(上海)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 达而观信息科技(上海)有限公司 filed Critical 达而观信息科技(上海)有限公司
Publication of WO2024000822A1 publication Critical patent/WO2024000822A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • This application relates to computer data processing technology, for example, to an anomaly detection method, device, equipment and medium for text classification and labeling samples.
  • Text classification is a common processing task in the field of machine learning.
  • Application scenarios include news classification, sentiment analysis, intent recognition, etc.
  • developers In the actual text classification task processing process, developers first need to complete a certain amount of labeled samples for the classification labels required by the scene, and then complete the text classification service construction through model training. In this process, the quality of the annotated samples will be closely related to the accuracy of the text service prediction. A sample set with high annotation quality will have better model performance, otherwise it will lead to poor classification results.
  • sample denoising is an important step in the development process of text classification applications.
  • normal samples when noisy samples are eliminated by judging whether the neural network has converged, normal samples may be selected to eliminate conflicting samples and erroneous samples may be retained, resulting in reduced data quality. If the recognition accuracy of noise samples is too low, it will easily introduce a lot of manual work, and may also cause normal samples to be mistakenly filtered out, while noise samples are retained.
  • This application provides an anomaly detection method, device, equipment and medium for text classification annotation samples, so as to effectively detect anomalies on text classification annotation samples and reduce the labor cost of sample denoising.
  • This application provides an anomaly detection method for text classification annotation samples, including:
  • each text classification annotation data includes a classification label
  • each cluster perform secondary clustering on multiple text classification annotation data with the same classification label to obtain classification sub-clusters corresponding to each cluster;
  • Abnormal classification sub-clusters are identified based on the proportion of the text classification annotation data in the classification sub-cluster to the text classification annotation data in the cluster to which the classification sub-cluster belongs.
  • This application also provides an anomaly detection device for text classification annotation samples, including:
  • the text classification annotation data set acquisition module is configured to obtain a text classification annotation data set to be denoised, wherein each text classification annotation data includes a classification label;
  • the cluster determination module is configured to calculate the semantic similarity between each two text classification annotation data, and perform clustering processing on the text classification annotation data in the text classification annotation data set according to the semantic similarity calculation results, Get at least one cluster;
  • the classification subcluster determination module is configured to perform secondary clustering on multiple text classification annotation data with the same classification label in each clustering cluster to obtain the classification subcluster corresponding to each clustering cluster;
  • the abnormal classification sub-cluster identification module is configured to identify abnormal classification sub-clusters based on the proportion of the text classification annotation data in the classification sub-cluster to the text classification annotation data in the cluster to which the classification sub-cluster belongs.
  • This application also provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the computer program, the above text classification annotation sample is implemented anomaly detection method.
  • the present application also provides a computer-readable storage medium on which a computer program is stored, wherein when the computer program is executed by a processor, the above-mentioned anomaly detection method for text classification and labeling samples is implemented.
  • Figure 1 is a flow chart of an anomaly detection method for text classification annotation samples provided in Embodiment 1 of the present application;
  • Figure 2 is a flow chart of an anomaly detection method for text classification annotation samples provided in Embodiment 2 of the present application;
  • Figure 3 is a schematic structural diagram of an anomaly detection device for text classification and labeling samples provided in Embodiment 3 of the present application;
  • Figure 4 is a schematic structural diagram of a computer device provided in Embodiment 4 of the present application.
  • Figure 1 is a flow chart of an anomaly detection method for text classification annotation samples provided in Embodiment 1 of the present application. This embodiment can be applied to the situation of performing sample denoising on text classification annotation samples.
  • the method of this embodiment can be executed by an anomaly detection device for text classification and annotation samples.
  • the device can be implemented in software and/or hardware, and the device can be configured in a server or terminal device.
  • the method includes:
  • Each text classification annotation data includes classification labels.
  • a text classification annotation data set is a data set that includes multiple text classification annotation data. Multiple text classification annotation data can be divided into different categories through classification labels.
  • S120 Calculate the semantic similarity between each two text classification annotation data, and perform clustering processing on the text classification annotation data in the text classification annotation data set according to the semantic similarity calculation result to obtain at least one cluster.
  • Semantic similarity can represent the similarity between two text classification annotation data.
  • the cosine distance between the two text classification annotation data can be calculated based on the trained semantic similarity model, and then the cosine distance between the two text classification annotation data can be obtained.
  • Clustering processing can be to use morphological operators to cluster and merge adjacent similar classification areas, that is, to cluster similar text classification annotation data.
  • a cluster can be a set of samples generated by clustering. Samples in the same cluster are similar to each other and different from samples in other clusters. After clustering similar text classification annotation data, one or more clusters can be obtained.
  • Calculating the semantic similarity between each two text classification annotation data includes: inputting each two text classification annotation data into a pre-trained semantic similarity model, and obtaining the semantics between each two text classification annotation data. Similarity.
  • the semantic similarity model can be a model that calculates the semantic similarity between two input text classification annotation data.
  • the text classification annotation data in the text classification annotation data set can be clustered. Assume that 2 clusters can be obtained.
  • the 20 text classification annotation data existing in classification label A and the 5 text classification annotation data existing in classification label D can be clustered into one category to obtain cluster 1.
  • the 50 text classification annotation data existing in classification label B and the 25 text classification annotation data existing in classification label C can be clustered into one category to obtain cluster 2.
  • the advantage of this setting is that by inputting each two text classification annotation data into the pre-trained semantic similarity model, the semantic similarity is obtained. It can make it more convenient to calculate the semantic similarity between the two, and the calculation of semantic similarity through the semantic similarity model is more reasonable and accurate.
  • each clustering cluster perform secondary clustering on multiple text classification annotation data with the same classification label to obtain classification subclusters corresponding to each clustering cluster.
  • a classification subcluster can be a subcluster that exists within a clustering cluster, and a clustering cluster can contain one or more classification subclusters.
  • the obtained cluster 1 and cluster 2 perform secondary clustering on multiple text classification annotation data with the same classification label.
  • Two classification sub-clusters can be obtained in cluster 1.
  • the 20 text classification annotation data existing in classification label A are classification sub-cluster 1
  • the 5 text classification annotation data existing in classification label D are classification sub-cluster 2.
  • Two classification sub-clusters can be obtained in cluster 2.
  • the 50 text classification annotation data existing in classification label B is classification sub-cluster 3
  • the 25 text classification annotation data existing in classification label C is classification sub-cluster 4.
  • Abnormal classification sub-clusters can be classification sub-clusters that do not meet the proportion weight filtering threshold in the cluster cluster.
  • Identify abnormal classification subclusters based on the proportion of the text classification annotation data in the classification subcluster to the text classification annotation data in the cluster to which the classification subcluster belongs, including: counting the text classification in the current classification subcluster the number of annotated data, and calculate the weight value of the proportion of the number of text classification annotated data in the current classification sub-cluster to the number of text classification annotated data in the cluster to which the current classification sub-cluster belongs; determine the proportion Whether the proportion weight value is greater than the preset proportion weight filtering threshold, if the proportion weight value is not greater than the preset proportion weight filtering threshold, identify the text classification annotation data in the current classification sub-cluster as an abnormal classification sub-cluster cluster.
  • the proportion weight value can be the weight value of the current classification sub-cluster in the cluster to which it belongs.
  • the proportion weight filtering threshold may be a preset filtering threshold of the proportion weight value. Assuming that the proportion weight value of the current classification sub-cluster is less than or equal to the proportion weight filtering threshold, the current classification sub-cluster is an abnormal classification sub-cluster. Assuming that the proportion weight value of the current classification sub-cluster is greater than the proportion weight filtering threshold, the current classification sub-cluster is a normal classification sub-cluster.
  • the proportion weight filtering threshold is 30%.
  • the weight value of classification sub-cluster 1 can be calculated to be 80%. Since 80% is greater than 30%, classification sub-cluster 1 is a normal classification sub-cluster.
  • the proportion weight value of classification subcluster 2 is 20%. Since 20% is less than 30%, classification subcluster 2 is an abnormal classification subcluster.
  • classification sub-cluster 3 is 66.67%. Since 66.67% is greater than 30%, classification sub-cluster 3 is a normal classification sub-cluster.
  • the weight value of classification subcluster 4 is 33.33%. Since 33.33% is greater than 30%, classification subcluster 4 is a normal classification subcluster.
  • the advantage of this setting is that it calculates the proportion weight value of the number of classification annotation data in the current classification sub-cluster to the number of text classification annotation data in the cluster to which it belongs, and compares it with the preset proportion weight filtering threshold , it can be determined that the current classification sub-cluster belongs to the normal classification sub-cluster or the abnormal classification sub-cluster. In this way, the current classification sub-cluster can be effectively judged based on the calculated proportion weight value, which improves the validity and reliability of abnormal classification sub-cluster judgment.
  • the technical solution provided by the embodiment of the present application is to obtain the text classification annotation data set to be denoised; calculate the semantic similarity between each two text classification annotation data, and aggregate the text classification annotation data in the text classification annotation data set.
  • Class processing is performed to obtain at least one cluster; in each cluster, multiple text classification annotation data with the same classification label are subjected to secondary clustering to obtain classification sub-clusters corresponding to each cluster; according to the classification
  • the proportion of the text classification annotation data in the sub-cluster to the text classification annotation data in the cluster to which the classification sub-cluster belongs is used to identify abnormal classification sub-clusters.
  • the embodiment of the present application solves the problem of heavy workload of staff due to the low recognition accuracy of the sample denoising model and the lack of explanatory explanations for sample denoising, achieves effective anomaly detection of text classification annotation samples, and improves The accuracy of sample denoising reduces the labor cost of sample denoising.
  • each two text classification annotation data into the pre-trained semantic similarity model Before inputting each two text classification annotation data into the pre-trained semantic similarity model, it also includes: inputting the obtained two sample classification annotation data to the parameter sharing layer respectively, and obtaining the two sample classification annotation data corresponding to each other. Multiple word vectors of; input multiple word vectors corresponding to the first sample classification annotation data to the pooling layer to obtain the first sample classification annotation data vector, and input multiple word vectors corresponding to the second sample classification annotation data Go to the pooling layer to obtain the second sample classification annotation data vector; calculate the absolute value of the difference between the first sample classification annotation data vector and the second sample classification annotation data vector to obtain the sample classification annotation data difference vector; splice the first The sample classification annotation data vector, the second sample classification annotation data vector and the sample classification annotation data difference vector are used to obtain the sample classification annotation data splicing vector; the sample classification annotation data splicing vector is input into the semantic classifier for training. After the training is completed Get the semantic similarity model.
  • the sample classification annotation data may be sample data obtained in the sample classification annotation data set.
  • the parameter sharing layer can be a sharing layer that can process the received sample classification annotation data.
  • the parameter sharing layer can be the Bert layer in the sentence-Bert semantic similarity model, which can use word vectors to process the received sample classification annotation data. way to express.
  • the word vector may be a vector obtained by vectorizing each word in the sample classification annotation data.
  • the first sample classification annotation data may be one sample data among the two sample classification annotation data.
  • the pooling layer can downsample a large matrix into a small matrix by partitioning the data, reducing the amount of calculation and preventing overfitting.
  • inputting multiple word vectors corresponding to the sample classification annotation data into the pooling layer means performing average processing on all word vectors corresponding to the sample classification annotation data.
  • the first sample classification annotation data vector may be a vector obtained by averaging all word vectors corresponding to the first sample classification annotation data.
  • the second sample classification annotation data may be the other sample data among the two sample classification annotation data.
  • the second sample classification annotation data vector may be a vector obtained by averaging all word vectors corresponding to the second sample classification annotation data.
  • the sample classification annotation data difference vector may be a difference vector obtained by calculating the difference between two sample classification annotation data vectors and performing absolute value processing on the obtained difference.
  • the sample classification annotation data splicing vector can be a vector obtained by splicing two or more vectors.
  • the semantic classifier may be a processing layer capable of performing semantic classification on the input sample classification annotation data splicing vector.
  • sample classification annotation data are obtained, which are the first sample classification annotation data and the second sample classification annotation data respectively.
  • the first sample classification annotation data and the second sample classification annotation data are respectively input to the parameter sharing layer.
  • the word vector obtained from the first sample classification annotation data is ⁇ m 1 , m 2 , m 3 ,..., m p ⁇ ;
  • the word vectors obtained from the second sample classification annotation data are ⁇ n 1 , n 2 , n 3 ,...,n q ⁇ .
  • the sample classification annotation data difference vector can be obtained as
  • the classification annotation data splicing vector w is a nine-dimensional vector.
  • the sample classification annotation data splicing vector w is input into the semantic classifier for training. After the training is completed, the semantic similarity model is obtained.
  • the advantage of this setting is that the semantic similarity model is trained through the sample classification annotation data, so that the trained semantic similarity model can more accurately output the semantic similarity between the two text classification annotation data, thus being more accurate. Clustering of multiple text classification annotation data.
  • the method further includes: Add explanation labels to each text classification annotation data in the system, and feed back the text classification annotation data after adding explanation labels to the user.
  • the explanation label may be a label that can explain the proportional weight value of the abnormal classification sub-cluster.
  • the proportion weight filtering threshold is 30%.
  • the weight value of classification sub-cluster 1 can be calculated to be 80%. Since 80% is greater than 30%, classification sub-cluster 1 is a normal classification sub-cluster.
  • the proportion weight value of classification subcluster 2 is 20%. Since 20% is less than 30%, classification subcluster 2 is an abnormal classification subcluster.
  • Classification subcluster 1 corresponds to label A
  • classification subcluster 2 corresponds to label D. Since classification subcluster 1 is a normal classification subcluster, there is no need to add explanation labels and provide feedback to the user. Since classification sub-cluster 2 is an abnormal classification sub-cluster, an explanation label can be added as follows: In cluster 1, the proportion weight value of label D is 20%, and the proportion weight value of label D is lower than the threshold.
  • classification sub-cluster 3 is 66.67%. Since 66.67% is greater than 30%, classification sub-cluster 3 is a normal classification sub-cluster.
  • the weight value of classification subcluster 4 is 33.33%. Since 33.33% is greater than 30%, classification subcluster 4 is a normal classification subcluster.
  • Classification subcluster 3 corresponds to label B, and classification subcluster 4 corresponds to label C. Since classification subcluster 3 and classification subcluster 4 are both normal classification subclusters, there is no need to add explanation labels and provide feedback to the user.
  • the advantage of this setting is that explanation labels are added to each text classification annotation data in the identified abnormal classification sub-clusters and fed back to the user. It can make the work of the staff more convenient, reduce the workload of the staff, improve the efficiency of the staff, and increase the readability of the labels corresponding to the abnormal classification sub-clusters.
  • Figure 2 is a flow chart of an anomaly detection method for text classification annotation samples provided in Embodiment 2 of the present application. This embodiment is explained based on the above embodiment.
  • the text classification annotation data in the text classification annotation data set is clustered according to the semantic similarity calculation results to obtain at least one cluster. Be explained.
  • the method includes:
  • One matrix element in the semantic similarity matrix is the semantic similarity between two text classification annotation data.
  • the semantic similarity matrix may be a similarity matrix obtained by filling in the semantic similarity between each two text classification annotation data.
  • the target data may be an unprocessed text classification annotation data selected from the text classification annotation data set, and use it as the data for target processing.
  • S250 Using the target data as a starting point, query the semantic similarity matrix, and sequentially traverse all density-connected data of the target data in the text classification annotation data set.
  • All density-connected data can be density-connected data associated with the target data, which can reflect the closeness of the target data to other data, thereby determining whether clustering can be performed.
  • S260 Combine the target data and all the density-connected data to form a cluster, and mark the status of each density-connected data as a processed status.
  • S270 Determine whether there is unprocessed text classification annotation data in all text classification annotation data. If there is unprocessed text classification annotation data, return to execution S240. If there is no unprocessed text classification annotation data, execute S280.
  • each cluster cluster perform secondary clustering on multiple text classification annotation data with the same classification label to obtain classification sub-clusters corresponding to each cluster cluster.
  • the text classification annotation data set to be denoised can be obtained.
  • the text classification annotation data set contains 100 text classification annotation data, and the semantic similarity between each two text classification annotation data is calculated, so that we can obtain Semantic similarity matrix.
  • Semantic similarity matrix Assume that there are 90 unprocessed text classification annotation data among 100 text classification annotation data, then select one unprocessed text classification annotation data among the 90 unprocessed text classification annotation data as target data 1, and label the target data The status of 1 is processed status. Taking target data 1 as the starting point, query the semantic similarity matrix, and traverse all the density-connected data of target data 1 in the text classification annotation data set one by one. It is assumed that there are 20 density-connected data.
  • the target data 1 and the 20 density-connected data together form a cluster 1, and the status of the 20 density-connected data are respectively marked as processed status.
  • each of the three clusters multiple text classification annotation data with the same classification label are subjected to secondary clustering to obtain classification subclusters corresponding to each cluster.
  • Abnormal classification sub-clusters are identified based on the proportion of the text classification annotation data in the classification sub-cluster to the text classification annotation data in the cluster to which the classification sub-cluster belongs.
  • Isolated text classification annotation data can be data that does not belong to any cluster after clustering processing.
  • the abnormal annotation data may be text classification annotation data belonging to an abnormal state.
  • the text classification annotation data set to be denoised contains 100 text classification annotation data
  • 3 clusters and 5 isolated text classification annotation data can be obtained, which are identified as Exceptionally labeled data.
  • the text classification annotation data set contains 100 text classification annotation data
  • it can be set that there are at least 10 text classification annotation data in each cluster before it can be determined as a valid cluster cluster.
  • 4 clusters can be obtained.
  • Cluster 1 contains 20 text classification annotation data
  • cluster 2 contains 5 text classification annotation data
  • Cluster 3 contains 50 text classification annotation data
  • cluster 4 contains 25 text classification annotation data.
  • the five text classification annotation data in cluster 2 can be clustered, since the threshold conditions for effective clustering are not met, the five text classification annotation data in cluster 2 are identified as abnormal annotation data.
  • the technical solution provided by the embodiment of the present application is to obtain the text classification annotation data set to be denoised; calculate the semantic similarity between each two text classification annotation data; and construct the semantic similarity based on the semantic similarity calculation result. degree matrix; in the text classification annotation data set, obtain an unprocessed text classification annotation data as the target data, and mark the status of the target data as a processed state; use the target data as the starting point to query the semantics Similarity matrix, sequentially traverse all density-connected data of the target data in the text classification annotation data set; combine the target data and all the density-connected data to form a cluster, and mark each density-connected data The status is the processed state; return to the operation of obtaining an unprocessed text classification annotation data as the target data in the text classification annotation data set until the processing of all text classification annotation data is completed; in each cluster , conduct secondary clustering on multiple text classification annotation data with the same classification label, and obtain the classification sub-cluster corresponding to each cluster cluster; according to the text classification annotation data in the classification sub-cluster,
  • Figure 3 is a schematic structural diagram of an anomaly detection device for text classification and annotation samples provided in the third embodiment of the present application.
  • the anomaly detection device for text classification and annotation samples provided in this embodiment can be implemented by software and/or hardware. Can be configured in terminal equipment or servers.
  • the anomaly detection device for text classification annotation samples is configured to implement any anomaly detection method for text classification annotation samples in the embodiments of the present application.
  • the device may include: a text classification annotation data set acquisition module 310, a cluster determination module 320, a classification sub-cluster determination module 330 and an abnormal classification sub-cluster identification module 340.
  • the text classification annotation data set acquisition module 310 is configured to obtain a text classification annotation data set to be denoised, wherein each text classification annotation data includes a classification label; the cluster determination module 320 is configured to calculate every two text classifications. Semantic similarity between the annotated data, and based on the semantic similarity calculation results, perform clustering processing on the text classification annotated data in the text classification annotated data set to obtain at least one cluster; the classification subcluster determination module 330 is set to In order to perform secondary clustering on multiple text classification annotation data with the same classification label in each cluster, and obtain classification sub-clusters corresponding to each cluster; the abnormal classification sub-cluster identification module 340 is set according to The proportion of the text classification annotation data in the classification sub-cluster to the text classification annotation data in the cluster to which the classification sub-cluster belongs is used to identify abnormal classification sub-clusters.
  • the technical solution provided by the embodiment of the present application is to obtain the text classification annotation data set to be denoised; calculate the semantic similarity between each two text classification annotation data, and aggregate the text classification annotation data in the text classification annotation data set.
  • Class processing is performed to obtain at least one cluster; in each cluster, multiple text classification annotation data with the same classification label are subjected to secondary clustering to obtain classification sub-clusters corresponding to each cluster; according to the classification
  • the proportion of the text classification annotation data in the sub-cluster to the text classification annotation data in the cluster to which the classification sub-cluster belongs is used to identify abnormal classification sub-clusters.
  • the embodiment of the present application solves the problem of heavy workload of staff due to low recognition accuracy of sample denoising model and lack of explanatory explanation for sample denoising, and enables effective anomaly detection of text classification annotation samples, improving It improves the accuracy of sample denoising and reduces the labor cost of sample denoising.
  • the cluster determination module 320 can be configured to: input every two text classification annotation data into the pre-trained semantic similarity model, and obtain the relationship between each two text classification annotation data. semantic similarity.
  • a semantic similarity model training module is also included, which can be configured to: before inputting each two text classification annotation data into the pre-trained semantic similarity model, the two obtained samples.
  • the classification and annotation data are respectively input to the parameter sharing layer to obtain multiple word vectors corresponding to the classification and annotation data of the two samples; the multiple word vectors corresponding to the classification and annotation data of the first sample are input to the pooling layer to obtain the first sample.
  • Classification annotation data vector input multiple word vectors corresponding to the second sample classification annotation data into the pooling layer to obtain the second sample classification annotation data vector; calculate the first sample classification annotation data vector and the second sample classification annotation data vector The absolute value of the difference is obtained to obtain the sample classification annotation data difference vector; splice the first sample classification annotation data vector, the second sample classification annotation data vector and the sample classification annotation data difference vector to obtain the sample classification annotation data splicing vector; Input the sample classification annotation data splicing vector into the semantic classifier for training. After the training is completed, the semantic similarity model is obtained.
  • the cluster determination module 320 may be configured to: construct a semantic similarity matrix according to the semantic similarity calculation result, wherein one matrix element in the semantic similarity matrix is two Semantic similarity between text classification annotation data; in the text classification annotation data set, obtain an unprocessed text classification annotation data as target data, and mark the status of the target data as a processed state; with the The target data is used as the starting point, the semantic similarity matrix is queried, and all density-connected data of the target data in the text classification annotation data set are traversed successively; the target data and all the density-connected data are jointly formed into a cluster. , and mark the processed status of each density-connected data; return and execute the operation of obtaining an unprocessed text classification annotation data as the target data in the text classification annotation data set until the processing of all text classification annotation data is completed. .
  • an abnormal annotation data determination module which can be configured to identify isolated text classification annotation data that does not belong to any cluster as abnormal annotation data.
  • the abnormal classification sub-cluster identification module 340 may be configured to: count the number of text classification annotation data in the current classification sub-cluster, and calculate the number of text classification annotation data in the current classification sub-cluster.
  • the proportion weight value of the number of text classification annotation data in the cluster to which the current classification sub-cluster belongs determine whether the proportion weight value is greater than the preset proportion weight filtering threshold, in response to the proportion If the weight value is not greater than the preset proportion weight filtering threshold, the current classification sub-cluster is identified as an abnormal classification sub-cluster.
  • an explanation label adding module which can be set to: according to the number of text classification annotation data in the classification sub-cluster in the text classification annotation data in the cluster to which the classification sub-cluster belongs. After identifying the abnormal classification sub-cluster, add an explanation label to each text classification annotation data in the abnormal classification sub-cluster, and feed back the text classification annotation data after adding the explanation label to the user.
  • the above-mentioned anomaly detection device for text classification and annotation samples can execute the anomaly detection method for text classification and annotation samples provided by any embodiment of the present application, and has functional modules and effects corresponding to the execution method.
  • Figure 4 is a schematic structural diagram of a computer device provided in Embodiment 4 of the present application.
  • the device includes a processor 410, a memory 420, an input device 430 and an output device 440; the number of processors 410 in the device can be one or more, and one processor 410 is taken as an example in Figure 4; the device The processor 410, the memory 420, the input device 430 and the output device 440 may be connected through a bus or other means. In Figure 4, connection through a bus is taken as an example.
  • the memory 420 can be configured to store software programs, computer executable programs and modules, such as program instructions/modules corresponding to the anomaly detection method for text classification and annotation samples in the embodiments of the present application (for example, text Classification annotation data set acquisition module 310, cluster determination module 320, classification sub-cluster determination module 330 and abnormal classification sub-cluster identification module 340).
  • the processor 410 executes various functional applications and data processing of the device by running software programs, instructions and modules stored in the memory 420, that is, to implement the above-mentioned anomaly detection method for text classification annotation samples, which method includes:
  • the proportion of the text classification annotation data in the cluster to the text classification annotation data in the cluster to which the classification sub-cluster belongs is used to identify abnormal classification sub-clusters.
  • the memory 420 may mainly include a stored program area and a stored data area, where the stored program area may store an operating system and an application program required for at least one function; the stored data area may store data created according to the use of the terminal, etc.
  • the memory 420 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device.
  • memory 420 may include memory located remotely from processor 410, and these remote memories may be connected to the device through a network. Examples of the above-mentioned networks include but are not limited to the Internet, intranets, local area networks, mobile communication networks and combinations thereof.
  • Input device 430 may be configured to receive input numeric or character information and to generate key signal inputs related to user settings and functional controls of the device.
  • the output device 440 may include a display device such as a display screen.
  • Embodiment 5 of the present application also provides a computer-readable storage medium.
  • the computer-readable instructions when executed by a computer processor, are used to perform an anomaly detection method for text classification annotation samples.
  • the method includes: obtaining the to-be-removed noisysy text classification annotation data set; calculate the semantic similarity between each two text classification annotation data, and perform clustering processing on the text classification annotation data in the text classification annotation data set according to the semantic similarity calculation results, and obtain At least one clustering cluster; in each clustering cluster, perform secondary clustering on multiple text classification annotation data with the same classification label to obtain classification subclusters corresponding to each clustering cluster; according to the classification subcluster The proportion of text classification annotation data in the text classification annotation data in the cluster to which the classification sub-cluster belongs is used to identify abnormal classification sub-clusters.
  • An embodiment of the present application provides a computer-readable storage medium, and its computer-readable instructions are not limited to the above-mentioned method operations, and can also perform the anomaly detection method of text classification annotation samples provided by any embodiment of the present application. related operations.
  • the present application can be implemented with the help of software and necessary general hardware, or can also be implemented with hardware.
  • the technical solution of this application can essentially be embodied in the form of a software product.
  • the computer software product can be stored in a computer-readable storage medium, such as a computer's floppy disk, read-only memory (Read-Only Memory, ROM), random access Memory (Random Access Memory, RAM), flash memory (FLASH), hard disk or optical disk, etc., includes multiple instructions to cause a computer device (which can be a personal computer, server, or network device, etc.) to execute the embodiments of this application. Methods.
  • the multiple units and modules included are only divided according to functional logic, but are not limited to the above divisions, as long as the corresponding functions can be realized; in addition, The specific names of multiple functional units are only for the convenience of distinguishing each other and are not used to limit the protection scope of the present application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本文公开了一种文本分类标注样本的异常检测方法、装置、设备及介质。该本分类标注样本的异常检测方法包括:获取待去噪的文本分类标注数据集;计算每两个文本分类标注数据之间的语义相似度,对文本分类标注数据集中的文本分类标注数据进行聚类处理,得到至少一个聚类簇;在每个聚类簇中,对相同分类标签的多个文本分类标注数据进行二次聚类,得到与每个聚类簇对应的分类子簇;根据分类子簇中的文本分类标注数据在所述分类子簇所属聚类簇中的文本分类标注数据中的数量占比,识别异常分类子簇。

Description

文本分类标注样本的异常检测方法、装置、设备及介质
本申请要求在2022年06月28日提交中国专利局、申请号为202210749204.X的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机数据处理技术,例如涉及一种文本分类标注样本的异常检测方法、装置、设备及介质。
背景技术
文本分类是机器学习领域中一项常见的处理任务,应用场景包括新闻分类、情感分析、意图识别等等。在实际的文本分类任务处理过程中,首先需要开发人员针对场景所需的分类标签完成一定量的标注样本,然后通过模型训练来完成文本分类服务构建。在此过程,标注样本的质量会和文本服务预测的准确率密切相关。标注质量高的样本集构建模型性能就较好,反之就会导致比较差的分类效果。影响标注质量产生噪声样本的原因有多种,包括标注团队内部标准不一致和标注人员的主观判断有错误等,所以样本去噪是文本分类应用开发过程的重要环节。
相关技术中,在通过判断神经网络是否收敛的方式剔除噪声样本时,有可能会在发生冲突的样本中选择正常的样本进行剔除、保留错误的样本,导致数据质量降低。噪声样本识别准确率过低容易引入大量人工工作,也会导致正常样本被错误的过滤掉,而噪声样本被保留。
发明内容
本申请提供了一种文本分类标注样本的异常检测方法、装置、设备及介质,以实现有效地对文本分类标注样本进行异常检测,降低样本去噪的人力成本。
本申请提供了一种文本分类标注样本的异常检测方法,包括:
获取待去噪的文本分类标注数据集,其中,每个文本分类标注数据中包括分类标签;
计算每两个文本分类标注数据之间的语义相似度,并根据语义相似度计算结果,对所述文本分类标注数据集中的文本分类标注数据进行聚类处理,得到至少一个聚类簇;
在每个聚类簇中,对相同分类标签的多个文本分类标注数据进行二次聚类, 得到与每个聚类簇对应的分类子簇;
根据分类子簇中的文本分类标注数据在所述分类子簇所属聚类簇中的文本分类标注数据中的数量占比,识别异常分类子簇。
本申请还提供了一种文本分类标注样本的异常检测装置,包括:
文本分类标注数据集获取模块,设置为获取待去噪的文本分类标注数据集,其中,每个文本分类标注数据中包括分类标签;
聚类簇确定模块,设置为于计算每两个文本分类标注数据之间的语义相似度,并根据语义相似度计算结果,对所述文本分类标注数据集中的文本分类标注数据进行聚类处理,得到至少一个聚类簇;
分类子簇确定模块,设置为在每个聚类簇中,对相同分类标签的多个文本分类标注数据进行二次聚类,得到与每个聚类簇对应的分类子簇;
异常分类子簇识别模块,设置为根据分类子簇中的文本分类标注数据在所述分类子簇所属聚类簇中的文本分类标注数据中的数量占比,识别异常分类子簇。
本申请还提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现上述的文本分类标注样本的异常检测方法。
本申请还提供了一种包含计算机可读存储介质,其上存储有计算机程序,其中,该计算机程序被处理器执行时实现上述的文本分类标注样本的异常检测方法。
附图说明
图1为本申请实施例一提供的一种文本分类标注样本的异常检测方法的流程图;
图2为本申请实施例二提供的一种文本分类标注样本的异常检测方法的流程图;
图3为本申请实施例三提供的一种文本分类标注样本的异常检测装置的结构示意图;
图4为本申请实施例四提供的一种计算机设备的结构示意图。
具体实施方式
下面结合附图和实施例对本申请进行说明。此处所描述的具体实施例仅仅 用于解释本申请。为了便于描述,附图中仅示出了与本申请相关的部分。
实施例一
图1为本申请实施例一提供的一种文本分类标注样本的异常检测方法的流程图。本实施例可适用于对文本分类标注样本进行样本去噪的情况。本实施例的方法可以由文本分类标注样本的异常检测装置执行,该装置可以通过软件和/或硬件的方式实现,该装置可配置于服务器或者终端设备中。
该方法包括:
S110、获取待去噪的文本分类标注数据集。
每个文本分类标注数据中包括分类标签。
文本分类标注数据集是包括多个文本分类标注数据的数据集合,可以通过分类标签将多个文本分类标注数据分成不同的种类。
示例性的,假设在文本分类标注数据集存中在100个文本分类标注数据,其中,分类标签A中存在20个文本分类标注数据,分类标签B中存在50个文本分类标注数据,分类标签C中存在25个文本分类标注数据,分类标签D中存在5个文本分类标注数据。
S120、计算每两个文本分类标注数据之间的语义相似度,并根据语义相似度计算结果,对所述文本分类标注数据集中的文本分类标注数据进行聚类处理,得到至少一个聚类簇。
语义相似度可以表示两个文本分类标注数据之间的相似度,可以根据训练好的语义相似度模型来计算两个文本分类标注数据的余弦距离,进而得出两个文本分类标注数据之间的语义相似度。聚类处理可以是运用形态学算子将临近的类似分类区域聚类并合并,也就是将相似的文本分类标注数据进行聚类处理。聚类簇可以是由聚类所生成的一组样本的集合,同一簇内样本彼此相似,与其他簇中的样本相异。将相似的文本分类标注数据进行聚类处理之后,可以得到一个或者多个聚类簇。
计算每两个文本分类标注数据之间的语义相似度,包括:将每两个文本分类标注数据输入至预先训练的语义相似度模型中,获取所述每两个文本分类标注数据之间的语义相似度。
语义相似度模型可以是根据输入的两个文本分类标注数据,计算出两者之间的语义相似度的模型。
续前例,假设在文本分类标注数据集中存在100个文本分类标注数据。计算100个文本分类标注数据中每两个文本分类标注数据之间的语义相似度。接 着可以根据语义相似度计算结果,对文本分类标注数据集中的文本分类标注数据进行聚类处理。假设可以得到2个聚类簇。分类标签A中存在的20个文本分类标注数据和分类标签D中存在的5个文本分类标注数据可以聚为一类,得到聚类簇1。分类标签B中存在的50个文本分类标注数据和分类标签C中存在的25个文本分类标注数据可以聚为一类,得到聚类簇2。
这样设置的好处在于:通过将每两个文本分类标注数据输入至预先训练的语义相似度模型中,得到语义相似度。可以使得计算两者之间的语义相似度更加方便,通过语义相似度模型进行语义相似度的计算更加合理准确。
S130、在每个聚类簇中,对相同分类标签的多个文本分类标注数据进行二次聚类,得到与每个聚类簇对应的分类子簇。
二次聚类可以是在得到的聚类簇里再进行聚类处理,得到更加相似的分类子簇。分类子簇可以是在聚类簇里中存在的子簇,在一个聚类簇里可以包含一个或者多个分类子簇。
续前例,在得到的聚类簇1和聚类簇2中,对相同分类标签的多个文本分类标注数据进行二次聚类。聚类簇1中可以得到两个分类子簇,分类标签A中存在的20个文本分类标注数据为分类子簇1,以及分类标签D中存在的5个文本分类标注数据为分类子簇2。聚类簇2中可以得到两个分类子簇,分类标签B中存在的50个文本分类标注数据为分类子簇3,以及分类标签C中存在的25个文本分类标注数据为分类子簇4。
S140、根据分类子簇中的文本分类标注数据在所述分类子簇所属聚类簇中的文本分类标注数据中的数量占比,识别异常分类子簇。
异常分类子簇可以是在聚类簇里,不满足占比权重过滤阈值的分类子簇。
根据分类子簇中的文本分类标注数据在所述分类子簇所属聚类簇中的文本分类标注数据中的数量占比,识别异常分类子簇,包括:统计在当前分类子簇中的文本分类标注数据的数量,并计算所述当前分类子簇中的文本分类标注数据的数量占所述当前分类子簇所属聚类簇中的文本分类标注数据的数量的占比权重值;判断所述占比权重值是否大于预设的占比权重过滤阈值,若所述占比权重值不大于预设的占比权重过滤阈值,将所述当前分类子簇中的文本分类标注数据识别为异常分类子簇。
占比权重值可以是当前分类子簇在所属聚类簇所占的权重值的大小。占比权重过滤阈值可以是预先设置的占比权重值的过滤阈值,假设当前分类子簇的占比权重值小于或者等于占比权重过滤阈值,则当前分类子簇为异常分类子簇。假设当前分类子簇的占比权重值大于占比权重过滤阈值,则当前分类子簇为正 常分类子簇。
续前例,假设占比权重过滤阈值为30%。在聚类簇1中,可以计算出分类子簇1的占比权重值为80%,由于80%大于30%,则分类子簇1为正常分类子簇。分类子簇2的占比权重值为20%,由于20%小于30%,则分类子簇2为异常分类子簇。
在聚类簇2中,可以计算出分类子簇3的占比权重值为66.67%,由于66.67%大于30%,则分类子簇3为正常分类子簇。分类子簇4的占比权重值为33.33%,由于33.33%大于30%,则分类子簇4为正常分类子簇。
这样设置的好处在于:通过计算当前分类子簇中的分类标注数据的数量占所属聚类簇中的文本分类标注数据的数量的占比权重值,并与预设的占比权重过滤阈值进行比较,可以确定当前分类子簇属于正常分类子簇或者异常分类子簇。这样可以根据计算出的占比权重值,有效地对当前分类子簇进行判断,提高了异常分类子簇判断的有效性和可靠性。
本申请实施例所提供的技术方案,通过获取待去噪的文本分类标注数据集;计算每两个文本分类标注数据之间的语义相似度,对文本分类标注数据集中的文本分类标注数据进行聚类处理,得到至少一个聚类簇;在每个聚类簇中,对相同分类标签的多个文本分类标注数据进行二次聚类,得到与每个聚类簇对应的分类子簇;根据分类子簇中的文本分类标注数据在所述分类子簇所属聚类簇中的文本分类标注数据中的数量占比,识别异常分类子簇。本申请实施例,解决了由于样本去噪模型识别准确率低而造成工作人员工作量大,以及样本去噪没有解释性说明的问题,实现了有效地对文本分类标注样本进行异常检测,提高了样本去噪的准确率,降低了样本去噪的人力成本。
在将每两个文本分类标注数据输入至预先训练的语义相似度模型中之前,还包括:将获取到的两个样本分类标注数据分别输入至参数共享层,得到两个样本分类标注数据分别对应的多个字向量;将第一样本分类标注数据对应的多个字向量输入至池化层,得到第一样本分类标注数据向量,将第二样本分类标注数据对应的多个字向量输入至池化层,得到第二样本分类标注数据向量;计算第一样本分类标注数据向量和第二样本分类标注数据向量的差值的绝对值,得到样本分类标注数据差值向量;拼接第一样本分类标注数据向量、第二样本分类标注数据向量以及样本分类标注数据差值向量,得到样本分类标注数据拼接向量;将样本分类标注数据拼接向量输入至语义分类器中来训练,训练完成之后得到语义相似度模型。
样本分类标注数据可以是在样本分类标注数据集中获取的样本数据。参数共享层可以是能够对接收到的样本分类标注数据进行处理的共享层,参数共享 层可以是sentence-Bert语义相似度模型中的Bert层,能够将接收到的样本分类标注数据用字向量的方式进行表示。字向量可以是样本分类标注数据中的每个字进行向量化的处理后得到的向量。
第一样本分类标注数据可以是两个样本分类标注数据中的一个样本数据。池化层可以通过对数据进行分区采样,把一个大的矩阵降采样成一个小的矩阵,减少计算量,同时可以防止过拟合。本实施例中将样本分类标注数据对应的多个字向量输入至池化层是指将样本分类标注数据对应的所有字向量进行均值处理。第一样本分类标注数据向量可以是第一样本分类标注数据对应的所有字向量进行均值处理而得到的向量。第二样本分类标注数据可以是两个样本分类标注数据中的另一个样本数据。第二样本分类标注数据向量可以是第二样本分类标注数据对应的所有字向量进行均值处理而得到的向量。样本分类标注数据差值向量可以是计算两个样本分类标注数据向量之间的差值,根据得到的差值进行绝对值处理之后得到的差值向量。样本分类标注数据拼接向量可以是由两个或者多个向量进行拼接得到的向量。语义分类器可以是能够对输入的样本分类标注数据拼接向量进行语义分类的处理层。
示例性的,假设获取到两个样本分类标注数据,分别为第一样本分类标注数据和第二样本分类标注数据。将第一样本分类标注数据和第二样本分类标注数据分别输入至参数共享层,第一样本分类标注数据得到的字向量为{m 1,m 2,m 3,…,m p};第二样本分类标注数据得到的字向量为{n 1,n 2,n 3,…,n q}。将第一样本分类标注数据对应的多个字向量输入至池化层,得到第一样本分类标注数据向量为
Figure PCTCN2022118488-appb-000001
将第二样本分类标注数据对应的多个字向量输入至池化层,得到第二样本分类标注数据向量为
Figure PCTCN2022118488-appb-000002
因此可以得到样本分类标注数据差值向量为|u-v|。
将向量u、v以及|u-v|进行拼接处理,得到样本分类标注数据拼接向量为w={u,v,|u-v},假设向量u和向量v为三维向量,那样本分类标注数据拼接向量w为九维向量。相应的,将样本分类标注数据拼接向量w输入至语义分类器中来训练,训练完成之后得到语义相似度模型。
这样设置的好处在于:通过样本分类标注数据对语义相似度模型进行训练,这样使得训练出的语义相似度模型可以更加准确地输出两个文本分类标注数据之间的语义相似度,从而能更加准确地对多个文本分类标注数据进行聚类处理。
在根据分类子簇中的文本分类标注数据在所述分类子簇所属聚类簇中的文本分类标注数据中的数量占比,识别异常分类子簇之后,还包括:在所述异常分类子簇中的每个文本分类标注数据中添加解释标签,并将添加解释标签后的 文本分类标注数据反馈给用户。
解释标签可以是能够对异常分类子簇的占比权重值进行解释的标签。
续前例,假设占比权重过滤阈值为30%。在聚类簇1中,可以计算出分类子簇1的占比权重值为80%,由于80%大于30%,则分类子簇1为正常分类子簇。分类子簇2的占比权重值为20%,由于20%小于30%,则分类子簇2为异常分类子簇。分类子簇1对应标签A,分类子簇2对应标签D。由于分类子簇1为正常分类子簇,所以不需要添加解释标签并反馈给用户。由于分类子簇2为异常分类子簇,可以添加解释标签为:在聚类簇1中,标签D的的占比权重值为20%,标签D的占比权重值低于阈值。
在聚类簇2中,可以计算出分类子簇3的占比权重值为66.67%,由于66.67%大于30%,则分类子簇3为正常分类子簇。分类子簇4的占比权重值为33.33%,由于33.33%大于30%,则分类子簇4为正常分类子簇。分类子簇3对应标签B,分类子簇4对应标签C。由于分类子簇3和分类子簇4均为正常分类子簇,所以不需要添加解释标签并反馈给用户。
这样设置的好处在于:通过在识别出的异常分类子簇中的每个文本分类标注数据中添加解释标签,并反馈给用户。可以更加便利工作人员的工作,能够降低工作人员的工作量,提高工作人员的效率,增加了异常分类子簇对应的标签的可读性。
实施例二
图2为本申请实施例二提供的一种文本分类标注样本的异常检测方法的流程图。本实施例以上述实施例为基础进行说明,在本实施例中,对根据语义相似度计算结果,对所述文本分类标注数据集中的文本分类标注数据进行聚类处理,得到至少一个聚类簇进行说明。
相应的,该方法包括:
S210、获取待去噪的文本分类标注数据集。
S220、计算每两个文本分类标注数据之间的语义相似度。
S230、根据所述语义相似度计算结果,构建得到语义相似度矩阵。
所述语义相似度矩阵中的一个矩阵元素为两个文本分类标注数据之间的语义相似度。
语义相似度矩阵可以是由每两个文本分类标注数据之间的语义相似度来进行填充,得到的相似度矩阵。
S240、在所述文本分类标注数据集中,获取一个未处理的文本分类标注数据作为目标数据,并标注所述目标数据的状态为已处理状态。
目标数据可以是在文本分类标注数据集中选取的一个未处理的文本分类标注数据,将其作为目标处理的数据。
S250、以所述目标数据为起点,查询所述语义相似度矩阵,逐次遍历所述文本分类标注数据集中所述目标数据的全部密度相连数据。
全部密度相连数据可以是与目标数据所关联的密度相连数据,能够反映出目标数据与其他数据的紧密程度,从而可以判别是否可以进行聚类。
S260、将所述目标数据与所述全部密度相连数据共同组成一个聚类簇,并标注每个密度相连数据的状态为已处理状态。
S270、判断在全部文本分类标注数据中是否存在未处理的文本分类标注数据,若存在未处理的文本分类标注数据,则返回执行S240,若不存在未处理的文本分类标注数据,则执行S280。
S280、在每个聚类簇中,对相同分类标签的多个文本分类标注数据进行二次聚类,得到与每个聚类簇对应的分类子簇。
S290、根据分类子簇中的文本分类标注数据在所述分类子簇所属聚类簇中的文本分类标注数据中的数量占比,识别异常分类子簇。
示例性的,首先可以获取到待去噪的文本分类标注数据集,假设文本分类标注数据集包含100个文本分类标注数据,计算每两个文本分类标注数据之间的语义相似度,从而可以得到语义相似度矩阵。假设在100个文本分类标注数据中存在90个未处理的文本分类标注数据,则在90个未处理的文本分类标注数据中选取一个未处理的文本分类标注数据作为目标数据1,并且标注目标数据1的状态为已处理状态。以目标数据1为起点,查询语义相似度矩阵,逐次遍历文本分类标注数据集中目标数据1的全部密度相连数据,假设存在20个密度相连数据。将目标数据1与20个密度相连数据共同组成一个聚类簇1,并分别标注20个密度相连数据的状态为已处理状态。这时剩余69个未处理的文本分类标注数据。因此在69个未处理的文本分类标注数据中选取一个未处理的文本分类标注数据作为目标数据,同理可以得到剩余的聚类簇。假设处理完毕之后,得到3个聚类簇。
分别在3个聚类簇中的每个聚类簇中,对相同分类标签的多个文本分类标注数据进行二次聚类,得到与每个聚类簇对应的分类子簇。根据分类子簇中的文本分类标注数据在所述分类子簇所属聚类簇中的文本分类标注数据中的数量占比,识别异常分类子簇。
根据语义相似度计算结果,对所述文本分类标注数据集中的文本分类标注数据进行聚类处理,得到至少一个聚类簇之后,还包括:将不属于任一聚类簇的孤立文本分类标注数据识别为异常标注数据。
孤立文本分类标注数据可以是进行聚类处理之后,不属于任何一个聚类簇的数据。异常标注数据可以是属于异常状态的文本分类标注数据。
示例性的,假设在待去噪的文本分类标注数据集中包含100个文本分类标注数据,进行聚类处理之后,可得到3个聚类簇和5个孤立文本分类标注数据,则将其识别为异常标注数据。
一个实施例中,假设文本分类标注数据集中包含100个文本分类标注数据,可以设置每个聚类簇中至少存在10个文本分类标注数据才可以确定为有效的聚类簇。假设对100个文本分类标注数据进行聚类处理之后,可得到4个聚类簇,聚类簇1中包含20个文本分类标注数据,聚类簇2中包含5个文本分类标注数据,聚类簇3中包含50个文本分类标注数据,聚类簇4中包含25个文本分类标注数据。虽然聚类簇2中的5个文本分类标注数据能够进行聚类,但是由于不满足有效的聚类簇的阈值条件,因此聚类簇2的5个文本分类标注数据识别为异常标注数据。
本申请实施例所提供的技术方案,通过获取待去噪的文本分类标注数据集;计算每两个文本分类标注数据之间的语义相似度,根据所述语义相似度计算结果,构建得到语义相似度矩阵;在所述文本分类标注数据集中,获取一个未处理的文本分类标注数据作为目标数据,并标注所述目标数据的状态为已处理状态;以所述目标数据为起点,查询所述语义相似度矩阵,逐次遍历所述文本分类标注数据集中所述目标数据的全部密度相连数据;将所述目标数据与所述全部密度相连数据共同组成一个聚类簇,并标注每个密度相连数据的状态为已处理状态;返回执行在所述文本分类标注数据集中,获取一个未处理的文本分类标注数据作为目标数据的操作,直至完成对全部文本分类标注数据的处理;在每个聚类簇中,对相同分类标签的多个文本分类标注数据进行二次聚类,得到与每个聚类簇对应的分类子簇;根据分类子簇中的文本分类标注数据在所述分类子簇所属聚类簇中的文本分类标注数据中的数量占比,识别异常分类子簇。本申请实施例,实现了能够有效地对文本分类标注样本进行聚类处理,进而提高了样本去噪的准确率。
实施例三
图3为本申请实施例三提供的一种文本分类标注样本的异常检测装置的结 构示意图,本实施例所提供的一种文本分类标注样本的异常检测装置可以通过软件和/或硬件来实现,可配置于终端设备或者服务器中。文本分类标注样本的异常检测装置设置为实现本申请实施例中的任一种文本分类标注样本的异常检测方法。如图3所示,该装置可包括:文本分类标注数据集获取模块310、聚类簇确定模块320、分类子簇确定模块330和异常分类子簇识别模块340。
文本分类标注数据集获取模块310,设置为获取待去噪的文本分类标注数据集,其中,每个文本分类标注数据中包括分类标签;聚类簇确定模块320,设置为计算每两个文本分类标注数据之间的语义相似度,并根据语义相似度计算结果,对所述文本分类标注数据集中的文本分类标注数据进行聚类处理,得到至少一个聚类簇;分类子簇确定模块330,设置为在每个聚类簇中,对相同分类标签的多个文本分类标注数据进行二次聚类,得到与每个聚类簇对应的分类子簇;异常分类子簇识别模块340,设置为根据分类子簇中的文本分类标注数据在所述分类子簇所属聚类簇中的文本分类标注数据中的数量占比,识别异常分类子簇。
本申请实施例所提供的技术方案,通过获取待去噪的文本分类标注数据集;计算每两个文本分类标注数据之间的语义相似度,对文本分类标注数据集中的文本分类标注数据进行聚类处理,得到至少一个聚类簇;在每个聚类簇中,对相同分类标签的多个文本分类标注数据进行二次聚类,得到与每个聚类簇对应的分类子簇;根据分类子簇中的文本分类标注数据在所述分类子簇所属聚类簇中的文本分类标注数据中的数量占比,识别异常分类子簇。本申请实施例,解决了由于样本去噪模型识别准确率低而造成工作人员工作量大,以及样本去噪没有解释性说明的问题,实现了能够有效地对文本分类标注样本进行异常检测,提高了样本去噪的准确率,降低了样本去噪的人力成本。
在上述实施例的基础上,聚类簇确定模块320,可以设置为:将每两个文本分类标注数据输入至预先训练的语义相似度模型中,获取所述每两个文本分类标注数据之间的语义相似度。
在上述实施例的基础上,还包括,语义相似度模型训练模块,可以设置为:在将每两个文本分类标注数据输入至预先训练的语义相似度模型中之前,将获取到的两个样本分类标注数据分别输入至参数共享层,得到两个样本分类标注数据分别对应的多个字向量;将第一样本分类标注数据对应的多个字向量输入至池化层,得到第一样本分类标注数据向量,将第二样本分类标注数据对应的多个字向量输入至池化层,得到第二样本分类标注数据向量;计算第一样本分类标注数据向量和第二样本分类标注数据向量的差值的绝对值,得到样本分类标注数据差值向量;拼接第一样本分类标注数据向量、第二样本分类标注数据向量以及样本分类标注数据差值向量,得到样本分类标注数据拼接向量;将样 本分类标注数据拼接向量输入至语义分类器中来训练,训练完成之后得到语义相似度模型。
在上述实施例的基础上,聚类簇确定模块320,可以设置为:根据所述语义相似度计算结果,构建得到语义相似度矩阵,其中,所述语义相似度矩阵中的一个矩阵元素为两个文本分类标注数据之间的语义相似度;在所述文本分类标注数据集中,获取一个未处理的文本分类标注数据作为目标数据,并标注所述目标数据的状态为已处理状态;以所述目标数据为起点,查询所述语义相似度矩阵,逐次遍历所述文本分类标注数据集中所述目标数据的全部密度相连数据;将所述目标数据与所述全部密度相连数据共同组成一个聚类簇,并标注每个密度相连数据的状态已处理状态;返回执行在所述文本分类标注数据集中,获取一个未处理的文本分类标注数据作为目标数据的操作,直至完成对全部文本分类标注数据的处理。
在上述实施例的基础上,还包括,异常标注数据确定模块,可以设置为:将不属于任一聚类簇的孤立文本分类标注数据识别为异常标注数据。
在上述实施例的基础上,异常分类子簇识别模块340,可以设置为:统计在当前分类子簇中的文本分类标注数据的数量,并计算所述当前分类子簇中的文本分类标注数据的数量占所述当前分类子簇所属聚类簇中的文本分类标注数据的数量的占比权重值;判断所述占比权重值是否大于预设的占比权重过滤阈值,响应于所述占比权重值不大于所述预设的占比权重过滤阈值,将所述当前分类子簇识别为异常分类子簇。
在上述实施例的基础上,还包括,解释标签添加模块,可以设置为:在根据分类子簇中的文本分类标注数据在所述分类子簇所属聚类簇中的文本分类标注数据中的数量占比,识别异常分类子簇之后,在所述异常分类子簇中的每个文本分类标注数据中添加解释标签,并将添加解释标签后的文本分类标注数据反馈给用户。
上述文本分类标注样本的异常检测装置可执行本申请任意实施例所提供的文本分类标注样本的异常检测方法,具备执行方法相应的功能模块和效果。
实施例四
图4为本申请实施例四提供的一种计算机设备的结构示意图。如图4所示,该设备包括处理器410、存储器420、输入装置430和输出装置440;设备中处理器410的数量可以是一个或多个,图4中以一个处理器410为例;设备中的处理器410、存储器420、输入装置430和输出装置440可以通过总线或其他方 式连接,图4中以通过总线连接为例。
存储器420作为一种计算机可读存储介质,可设置为存储软件程序、计算机可执行程序以及模块,如本申请实施例中的文本分类标注样本的异常检测方法对应的程序指令/模块(例如,文本分类标注数据集获取模块310、聚类簇确定模块320、分类子簇确定模块330和异常分类子簇识别模块340)。处理器410通过运行存储在存储器420中的软件程序、指令以及模块,从而执行设备的多种功能应用以及数据处理,即实现上述的文本分类标注样本的异常检测方法,该方法包括:
获取待去噪的文本分类标注数据集;计算每两个文本分类标注数据之间的语义相似度,并根据语义相似度计算结果,对所述文本分类标注数据集中的文本分类标注数据进行聚类处理,得到至少一个聚类簇;在每个聚类簇中,对相同分类标签的多个文本分类标注数据进行二次聚类,得到与每个聚类簇对应的分类子簇;根据分类子簇中的文本分类标注数据在所述分类子簇所属聚类簇中的文本分类标注数据中的数量占比,识别异常分类子簇。
存储器420可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序;存储数据区可存储根据终端的使用所创建的数据等。此外,存储器420可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在一些实例中,存储器420可包括相对于处理器410远程设置的存储器,这些远程存储器可以通过网络连接至设备。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
输入装置430可设置为接收输入的数字或字符信息,以及产生与设备的用户设置以及功能控制有关的键信号输入。输出装置440可包括显示屏等显示设备。
实施例五
本申请实施例五还提供一种包含计算机可读存储介质,所述计算机可读指令在由计算机处理器执行时用于执行一种文本分类标注样本的异常检测方法,该方法包括:获取待去噪的文本分类标注数据集;计算每两个文本分类标注数据之间的语义相似度,并根据语义相似度计算结果,对所述文本分类标注数据集中的文本分类标注数据进行聚类处理,得到至少一个聚类簇;在每个聚类簇中,对相同分类标签的多个文本分类标注数据进行二次聚类,得到与每个聚类簇对应的分类子簇;根据分类子簇中的文本分类标注数据在所述分类子簇所属 聚类簇中的文本分类标注数据中的数量占比,识别异常分类子簇。
本申请实施例所提供的一种包含计算机可读存储介质,其计算机可读指令不限于如上所述的方法操作,还可以执行本申请任意实施例所提供的文本分类标注样本的异常检测方法中的相关操作。
通过以上关于实施方式的描述,了解到,本申请可借助软件及必需的通用硬件来实现,也可以通过硬件实现。本申请的技术方案本质上可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如计算机的软盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、闪存(FLASH)、硬盘或光盘等,包括多个指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请实施例所述的方法。
上述文本分类标注样本的异常检测装置的实施例中,所包括的多个单元和模块只是按照功能逻辑进行划分的,但并不局限于上述的划分,只要能够实现相应的功能即可;另外,多个功能单元的具体名称也只是为了便于相互区分,并不用于限制本申请的保护范围。

Claims (10)

  1. 一种文本分类标注样本的异常检测方法,包括:
    获取待去噪的文本分类标注数据集,其中,每个文本分类标注数据中包括分类标签;
    计算每两个文本分类标注数据之间的语义相似度,并根据语义相似度计算结果,对所述文本分类标注数据集中的文本分类标注数据进行聚类处理,得到至少一个聚类簇;
    在每个聚类簇中,对相同分类标签的多个文本分类标注数据进行二次聚类,得到与每个聚类簇对应的分类子簇;
    根据分类子簇中的文本分类标注数据在所述分类子簇所属聚类簇中的文本分类标注数据中的数量占比,识别异常分类子簇。
  2. 根据权利要求1所述的方法,其中,所述计算每两个文本分类标注数据之间的语义相似度,包括:
    将每两个文本分类标注数据输入至预先训练的语义相似度模型中,获取所述每两个文本分类标注数据之间的语义相似度。
  3. 根据权利要求2所述的方法,在所述将每两个文本分类标注数据输入至预先训练的语义相似度模型中之前,还包括:
    将获取到的两个样本分类标注数据分别输入至参数共享层,得到所述两个样本分类标注数据分别对应的多个字向量;
    将第一样本分类标注数据对应的多个字向量输入至池化层,得到第一样本分类标注数据向量,将第二样本分类标注数据对应的多个字向量输入至所述池化层,得到第二样本分类标注数据向量;
    计算所述第一样本分类标注数据向量和所述第二样本分类标注数据向量的差值的绝对值,得到样本分类标注数据差值向量;
    拼接所述第一样本分类标注数据向量、所述第二样本分类标注数据向量以及所述样本分类标注数据差值向量,得到样本分类标注数据拼接向量;
    将样本分类标注数据拼接向量输入至语义分类器中来训练,训练完成之后得到所述语义相似度模型。
  4. 根据权利要求1所述的方法,其中,所述根据语义相似度计算结果,对所述文本分类标注数据集中的文本分类标注数据进行聚类处理,得到至少一个聚类簇,包括:
    根据所述语义相似度计算结果,构建得到语义相似度矩阵,其中,所述语 义相似度矩阵中的一个矩阵元素为两个文本分类标注数据之间的语义相似度;
    在所述文本分类标注数据集中,获取一个未处理的文本分类标注数据作为目标数据,并标注所述目标数据的状态为已处理状态;
    以所述目标数据为起点,查询所述语义相似度矩阵,逐次遍历所述文本分类标注数据集中所述目标数据的全部密度相连数据;
    将所述目标数据与所述全部密度相连数据共同组成一个聚类簇,并标注每个密度相连数据的状态为已处理状态;
    返回执行在所述文本分类标注数据集中,获取一个未处理的文本分类标注数据作为目标数据的操作,直至完成对全部文本分类标注数据的处理。
  5. 根据权利要求4所述的方法,其中,所述根据语义相似度计算结果,对所述文本分类标注数据集中的文本分类标注数据进行聚类处理,得到至少一个聚类簇之后,还包括:
    将不属于一聚类簇的孤立文本分类标注数据识别为异常标注数据。
  6. 根据权利要求1-5任一项所述的方法,其中,所述根据分类子簇中的文本分类标注数据在所述分类子簇所属聚类簇中的文本分类标注数据中的数量占比,识别异常分类子簇,包括:
    统计在当前分类子簇中的文本分类标注数据的数量,并计算所述当前分类子簇中的文本分类标注数据的数量占所述当前分类子簇所属聚类簇中的文本分类标注数据的数量的占比权重值;
    判断所述占比权重值是否大于预设的占比权重过滤阈值,响应于所述占比权重值不大于所述预设的占比权重过滤阈值,将所述当前分类子簇识别为所述异常分类子簇。
  7. 根据权利要求1-5任一项所述的方法,在所述根据分类子簇中的文本分类标注数据在所述分类子簇所属聚类簇中的文本分类标注数据中的数量占比,识别异常分类子簇之后,还包括:
    在所述异常分类子簇中的每个文本分类标注数据中添加解释标签,并将添加解释标签后的文本分类标注数据反馈给用户。
  8. 一种文本分类标注样本的异常检测装置,包括:
    文本分类标注数据集获取模块,设置为获取待去噪的文本分类标注数据集,其中,每个文本分类标注数据中包括分类标签;
    聚类簇确定模块,设置为计算每两个文本分类标注数据之间的语义相似度, 并根据语义相似度计算结果,对所述文本分类标注数据集中的文本分类标注数据进行聚类处理,得到至少一个聚类簇;
    分类子簇确定模块,设置为在每个聚类簇中,对相同分类标签的多个文本分类标注数据进行二次聚类,得到与每个聚类簇对应的分类子簇;
    异常分类子簇识别模块,设置为根据分类子簇中的文本分类标注数据在所述分类子簇所属聚类簇中的文本分类标注数据中的数量占比,识别异常分类子簇。
  9. 一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现如权利要求1-7中任一项所述的文本分类标注样本的异常检测方法。
  10. 一种计算机可读存储介质,存储有计算机程序,其中,所述计算机程序被处理器执行时实现如权利要求1-7中任一所述的文本分类标注样本的异常检测方法。
PCT/CN2022/118488 2022-06-28 2022-09-13 文本分类标注样本的异常检测方法、装置、设备及介质 WO2024000822A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210749204.XA CN115098679A (zh) 2022-06-28 2022-06-28 文本分类标注样本的异常检测方法、装置、设备及介质
CN202210749204.X 2022-06-28

Publications (1)

Publication Number Publication Date
WO2024000822A1 true WO2024000822A1 (zh) 2024-01-04

Family

ID=83295445

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/118488 WO2024000822A1 (zh) 2022-06-28 2022-09-13 文本分类标注样本的异常检测方法、装置、设备及介质

Country Status (2)

Country Link
CN (1) CN115098679A (zh)
WO (1) WO2024000822A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118296147A (zh) * 2024-04-17 2024-07-05 江苏物润船联网络股份有限公司 基于知识库的弹幕响应文本智能匹配方法及系统
CN118412080A (zh) * 2024-07-02 2024-07-30 山东第一医科大学附属省立医院(山东省立医院) 一种儿科临床数据采集方法、系统及设备

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116541731B (zh) * 2023-05-26 2024-07-23 北京百度网讯科技有限公司 网络行为数据的处理方法、装置和设备
CN116757807B (zh) * 2023-08-14 2023-11-14 湖南华菱电子商务有限公司 一种基于光学字符识别的智能辅助评标方法

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020161763A1 (en) * 2000-10-27 2002-10-31 Nong Ye Method for classifying data using clustering and classification algorithm supervised
CN110928862A (zh) * 2019-10-23 2020-03-27 深圳市华讯方舟太赫兹科技有限公司 数据清洗方法、数据清洗设备以及计算机存储介质
CN114398350A (zh) * 2021-12-30 2022-04-26 以萨技术股份有限公司 训练数据集的清洗方法、装置及服务器

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020161763A1 (en) * 2000-10-27 2002-10-31 Nong Ye Method for classifying data using clustering and classification algorithm supervised
CN110928862A (zh) * 2019-10-23 2020-03-27 深圳市华讯方舟太赫兹科技有限公司 数据清洗方法、数据清洗设备以及计算机存储介质
CN114398350A (zh) * 2021-12-30 2022-04-26 以萨技术股份有限公司 训练数据集的清洗方法、装置及服务器

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118296147A (zh) * 2024-04-17 2024-07-05 江苏物润船联网络股份有限公司 基于知识库的弹幕响应文本智能匹配方法及系统
CN118412080A (zh) * 2024-07-02 2024-07-30 山东第一医科大学附属省立医院(山东省立医院) 一种儿科临床数据采集方法、系统及设备

Also Published As

Publication number Publication date
CN115098679A (zh) 2022-09-23

Similar Documents

Publication Publication Date Title
WO2024000822A1 (zh) 文本分类标注样本的异常检测方法、装置、设备及介质
WO2022126810A1 (zh) 文本聚类方法
CN109981625B (zh) 一种基于在线层次聚类的日志模板抽取方法
CN111309912A (zh) 文本分类方法、装置、计算机设备及存储介质
WO2019080411A1 (zh) 电子装置、人脸图像聚类搜索方法和计算机可读存储介质
CN110795919A (zh) 一种pdf文档中的表格抽取方法、装置、设备及介质
CN112163424A (zh) 数据的标注方法、装置、设备和介质
CN110083832B (zh) 文章转载关系的识别方法、装置、设备及可读存储介质
CN112036514B (zh) 一种图像分类方法、装置、服务器及计算机可读存储介质
CN109284369B (zh) 证券新闻资讯重要性的判定方法、系统、装置及介质
CN110704603A (zh) 一种通过资讯发掘当前热点事件的方法和装置
CN111738341B (zh) 一种分布式大规模人脸聚类方法及装置
CN111046282B (zh) 文本标签设置方法、装置、介质以及电子设备
CN108959329A (zh) 一种文本分类方法、装置、介质及设备
CN113158777B (zh) 质量评分方法、质量评分模型的训练方法及相关装置
CN113762377A (zh) 网络流量识别方法、装置、设备及存储介质
CN112364163A (zh) 日志的缓存方法、装置以及计算机设备
CN116841779A (zh) 异常日志检测方法、装置、电子设备和可读存储介质
CN116029280A (zh) 一种文档关键信息抽取方法、装置、计算设备和存储介质
CN111738290B (zh) 图像检测方法、模型构建和训练方法、装置、设备和介质
CN113283396A (zh) 目标对象的类别检测方法、装置、计算机设备和存储介质
CN116843677A (zh) 钣金件的外观质量检测系统及其方法
CN111597336A (zh) 训练文本的处理方法、装置、电子设备及可读存储介质
CN111314109A (zh) 一种基于弱密钥的大规模物联网设备固件识别方法
US12094232B2 (en) Automatically determining table locations and table cell types

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22948911

Country of ref document: EP

Kind code of ref document: A1