CN115204318A - Event automatic hierarchical classification method and electronic equipment - Google Patents

Event automatic hierarchical classification method and electronic equipment Download PDF

Info

Publication number
CN115204318A
CN115204318A CN202211118551.9A CN202211118551A CN115204318A CN 115204318 A CN115204318 A CN 115204318A CN 202211118551 A CN202211118551 A CN 202211118551A CN 115204318 A CN115204318 A CN 115204318A
Authority
CN
China
Prior art keywords
data
classified
semantic representation
classification
representation vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211118551.9A
Other languages
Chinese (zh)
Other versions
CN115204318B (en
Inventor
朵思惟
余梓飞
张程华
张艳丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin Huizhi Xingyuan Information Technology Co ltd
Original Assignee
Tianjin Huizhi Xingyuan Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin Huizhi Xingyuan Information Technology Co ltd filed Critical Tianjin Huizhi Xingyuan Information Technology Co ltd
Priority to CN202211118551.9A priority Critical patent/CN115204318B/en
Publication of CN115204318A publication Critical patent/CN115204318A/en
Application granted granted Critical
Publication of CN115204318B publication Critical patent/CN115204318B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

According to the automatic event level classification method and the electronic equipment, data to be classified are obtained from a network, and a non-supervision learning method is adopted to conduct multiple rounds of iteration on the classification process of the data to be classified. In response to an iteration cutoff condition being met, outputting annotation data which corresponds to the data to be classified and comprises at least one level label and a plurality of categories, wherein the level label is used for representing the classification level of the annotation data. According to the automatic event level classification method, the events can be subjected to unsupervised level classification, the classification accuracy of the unsupervised classification learning method is improved by combining the clustering algorithm, the accurate classification of the levels and the classes of the events is realized under the condition that manual labeling is not needed, and the labor cost and the time cost generated by a large number of manual labeling are reduced. The event automatic hierarchical classification method can be used in different fields and scenes and has good mobility.

Description

Event automatic hierarchical classification method and electronic equipment
Technical Field
The present application relates to the field of data processing technologies, and in particular, to an event automatic level classification method and an electronic device.
Background
Under the assistance of information technologies such as a big data platform and artificial intelligence, a large amount of social management events can be generated every day. At present, the cause analysis of social management events mainly comprises two modes, namely supervised learning and unsupervised learning, the accuracy of the supervised learning is high, but a large amount of manual labeling data is needed to train a deep learning model to obtain a good effect, the model can only be used in a single scene, and data labeling needs to be carried out again in a conversion scene, so that the mobility of the model is poor. Although the unsupervised learning mode does not need to label data, the accuracy is low, and the reference significance of the unsupervised learning mode applied to an actual scene is not large.
Disclosure of Invention
In view of the above, an object of the present application is to provide an event automatic hierarchical classification method, apparatus, electronic device and storage medium, so as to solve the problems that supervised learning requires a large amount of manual labels and unsupervised learning has low accuracy.
A first aspect of the present application provides an event automatic hierarchical classification method, including:
acquiring data to be classified;
performing multiple rounds of iteration on the classification process of the data to be classified by adopting an unsupervised learning method;
in response to the iteration cutoff condition being met, outputting the annotation data which corresponds to the data to be classified, comprises at least one hierarchical label and has a plurality of categories, wherein the hierarchical label is used for representing the classification hierarchy of the annotation data.
The second aspect of the present application also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable by the processor, the processor implementing the method as described above when executing the computer program.
From the above, according to the event automatic hierarchical classification method and the electronic device provided by the application, the data to be classified is acquired from the network, and the classification process of the data to be classified is subjected to multiple iterations by adopting an unsupervised learning method. In response to an iteration cutoff condition being met, outputting annotation data which corresponds to the data to be classified and comprises at least one hierarchy label and a plurality of categories, wherein the hierarchy label is used for representing a classification hierarchy of the annotation data. According to the automatic event level classification method, the events can be subjected to unsupervised level classification, the classification accuracy of the unsupervised classification learning method is improved by combining the clustering algorithm, the accurate classification of the event categories is realized under the condition that manual labeling is not needed, and the labor cost and the time cost generated by a large number of manual labeling are reduced. Through a plurality of rounds of iteration processes, the hierarchical classification of the events is realized, and the classification effect is improved. Meanwhile, the automatic event level classification method can be used in different fields and scenes and has good mobility.
Drawings
In order to more clearly illustrate the technical solutions in the present application or the related art, the drawings needed to be used in the description of the embodiments or the related art will be briefly introduced below, and it is obvious that the drawings in the following description are only embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.
FIG. 1a is a schematic flow chart illustrating an event automatic hierarchical classification method according to an embodiment of the present application;
FIG. 1b is a schematic diagram of hierarchical classification labels for multiple iterations according to an embodiment of the present application;
FIG. 2 is a schematic flow chart of a multi-iteration loop according to an embodiment of the present application;
FIG. 3a is a diagram illustrating an embedding effect of a first semantic representation vector according to an embodiment of the present disclosure;
FIG. 3b is a diagram of an embedding effect of a second semantic representation vector according to an embodiment of the present application;
FIG. 4 is a schematic diagram illustrating a process of performing multiple rounds of iterative classification on data to be classified according to an embodiment of the present application;
FIG. 5 is a diagram illustrating an embedded model of an overlay classification layer according to an embodiment of the present application;
FIG. 6 is a schematic diagram illustrating an apparatus structure of an event automatic hierarchical classification method according to an embodiment of the present application;
fig. 7 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described in detail below with reference to specific embodiments and the accompanying drawings.
It should be noted that technical terms or scientific terms used in the embodiments of the present application should have a general meaning as understood by those having ordinary skill in the art to which the present application belongs, unless otherwise defined. The use of "first," "second," and similar terms in the embodiments of the present application is not intended to indicate any order, quantity, or importance, but rather is used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used merely to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.
As described in the background art, with the assistance of information technologies such as a big data platform and artificial intelligence, a large number of social management events are generated every day, and how to rapidly and accurately mine the homologous events of the same cause from the events, so that the social contradiction and the social problem are fundamentally solved and are urgently needed to be solved in the current social management field. At present, the cause analysis of social management events mainly comprises two modes, namely supervised learning and unsupervised learning. Supervised learning is a machine learning task that infers a function from a set of labeled training data, which consists of a set of training examples. The supervised learning has high accuracy, but a large amount of manually marked data is needed to train the deep learning model to obtain a better effect, and the model has poor mobility. Solving various problems in pattern recognition from unlabeled training samples is referred to as unsupervised learning. The unsupervised learning mode does not need to label data, but has lower accuracy, and has poorer effect when being applied to an actual scene.
In view of the above, the application provides an automatic event hierarchical classification method, which integrates a data dimension reduction algorithm and a density clustering algorithm into an iteration process, so as to ensure that data do not need to be manually marked on the basis of accurately classifying events, greatly reduce labor cost, and can be used in different fields and scenes, such as event homologous analysis in social management scenes, and intelligent question and answer in the fields of law and medical treatment, and has good mobility.
Embodiments of the present application are described in detail below with reference to the accompanying drawings.
The application provides an event automatic hierarchical classification method, referring to fig. 1a, comprising the following steps:
and 102, acquiring data to be classified. The data to be classified refers to a large number of unclassified events in the same field, which are acquired from a network, such as collected from a large data platform, and the field can be a social management field, a legal field or a medical field, and the like. If the event to be classified is in the field of social governance, the source of the event to be classified may be information reported by community grid members, including common social problems such as disputes, complaints, compensation and environment.
It should be noted that, before classifying the acquired data to be classified, preprocessing is required to be performed on the data to be classified. For example, for unstructured events, preprocessing may include removing duplicate data, removing rows with empty strings, unifying data formats, and so on. Those skilled in the art can adopt a corresponding preprocessing mode for the data to be classified according to the actual characteristics of the data to be classified and the data classification requirement, and the method is not limited specifically here.
And 104, performing multiple iterations on the classification process of the data to be classified by adopting an unsupervised learning method.
It should be noted that, because the sentence embedding degree obtained by the sentence embedding model based on the pre-trained language model is low for the spatial embedding degree of different semantic sentences, the clustering result accuracy is extremely low when the sentence embedding model is directly used for clustering, and the clustering result cannot be practically applied.
In order to solve the problem, in this embodiment, an embedding model is first used to embed the data to be classified to obtain an embedded representation, and then a clustering algorithm is used to cluster the embedded representation to obtain data with pseudo labels of multiple categories. By adopting an unsupervised learning method, the embedded model is finely adjusted through pseudo-labeled data of multiple categories, so that the embedding performance of the embedded model is improved, and the spatial embedding discrimination of the embedded model is also improved. The process does not need a large amount of manual labeling, and the cost of manual labeling is reduced. And finally, clustering the new embedded representation through the clustering model to obtain marking data of a plurality of categories. The process is one iteration in a multi-iteration process, and after the multi-iteration process, accurate hierarchical classification of the data to be classified can be achieved.
And obtaining labeled data of a plurality of categories through a clustering algorithm, wherein each category corresponds to a cluster, and all data in the cluster belong to the category. Too much or too little category number will affect the clustering result, and the category number can be adjusted by changing the parameters of the clustering algorithm, so as to obtain a more suitable category number.
It should be noted that the embedding model in this embodiment is a sentence embedding model, and compared with a word embedding model, the sentence embedding model can better understand the sentence semantics of the data to be classified.
The Clustering algorithm may be a Density-Based Spatial Clustering of Applications with Noise (DBSCAN), a high-level Spatial Clustering of Applications with Noise (HDBSCAN), or a Clustering algorithm that requires a given number of classes, such as a K-Means Clustering algorithm (K-Means Clustering, K-Means), a K-Means + + algorithm, a kernel K-Means Clustering algorithm (KernelK-Means Clustering), and the like.
And 106, responding to the condition that an iteration cutoff condition is met, outputting the labeled data which corresponds to the data to be classified and comprises at least one hierarchical label and a plurality of categories, wherein the hierarchical label is used for representing the classification hierarchy of the labeled data.
Specifically, the iteration cutoff condition may be a preset iteration number or a minimum loss, and the like, and in this embodiment, the iteration cutoff condition is not specifically limited, and a corresponding matched iteration cutoff condition may be set according to an actual iteration requirement. After multiple iterations of step 104, annotation data with several categories including at least one hierarchical label is obtained. Illustratively, hierarchical label categories may include primary labels, secondary labels, tertiary labels, and the like. Each level label includes multiple categories, for example, a level label includes a category A, a category B, and a category C, a level label includes a category D, a category E, and a category F, and a level label includes a category G, a category H, and a category I. Each primary label may or may not include at least one secondary label, and each secondary label may or may not include at least one tertiary label. The acquisition process of the hierarchical labels and categories is described below in conjunction with an iterative process:
referring to fig. 1b, the hierarchical label and several categories are similar to a tree structure after multiple iterations. Specifically, after a first iteration, a primary label including 3 categories is obtained, and the 3 categories are an a category, a B category and a C category respectively. After the second iteration, obtaining secondary labels by a category A and a category B in the 3 categories, wherein the secondary labels of the category A comprise two categories, namely a category D and a category E; the secondary label of the B category comprises a category which is an F category; category C does not get a secondary label. After the third iteration, the D category obtains three-level labels comprising two categories, namely a G category and an H category; the category E obtains a three-level label containing one category, and the three-level label is the category I; category F does not result in a tertiary label.
As can be seen from the above, after a plurality of iterations, labeled data with a plurality of categories including at least one hierarchical label of the data to be classified can be obtained, where the hierarchical label includes a primary label, a secondary label, and a tertiary label, and the plurality of categories include the above-mentioned category a, category B, category C, category D, category E, category F, category G, category H, and category I.
In some embodiments, the classifying process of the data to be classified is performed through multiple iterations by using an unsupervised learning method, and with reference to fig. 2, the method includes the following steps:
for each iteration the following operations are performed:
step 202, inputting the data to be classified into a preset embedding model, and outputting a first semantic representation vector through the embedding model. For example, the embedding model in this embodiment may be an unsupervised chinese Sentence embedding model, such as a sequence-Bert, bert-flow, or Bert-Whitening embedding model, and other types of embedding models may be selected according to actual embedding requirements, which is not limited herein. And outputting a first semantic representation vector through the embedding model, wherein the first semantic representation vector is an embedded representation of the statement in the data to be classified.
And 204, clustering the first semantic representation vector through a first clustering algorithm to obtain pseudo labeling data of multiple categories.
The first clustering algorithm in this embodiment is a density clustering algorithm, and since the data to be classified is unlabeled data, the category of the data to be classified cannot be obtained, so that the density clustering algorithm needs to be adopted to obtain high quality data in the data to be classified, and meanwhile, the preliminary classification category and number of the data to be classified can also be obtained. The plurality of classes may be K classes, K being an integer greater than zero. However, the embedding model at this time has low spatial embedding discrimination for the data to be classified, so that the clustering effect is not ideal.
And 206, adjusting parameters of the embedded model through the pseudo labeling data of the multiple categories.
Specifically, after the data to be classified is clustered through the first clustering algorithm, pseudo labeling data of multiple categories with labels are obtained, parameters of the embedding model are finely adjusted through the pseudo labeling data of the multiple categories, the parameters of the embedding model are updated, the spatial embedding discrimination of the embedding model when the embedding model embeds sentences can be improved, so that the distance between sentences with similar semantics is reduced, and the distance between sentences with different semantics is increased.
And 208, inputting the data to be classified into the adjusted embedding model, and outputting a second semantic representation vector through the adjusted embedding model. After the adjusted embedding model is obtained, the classification data is embedded again through the adjusted embedding model to obtain a second semantic representation vector, and compared with the first semantic representation vector, the second semantic representation vector with similar sentence meanings is close in distance, and the second semantic representation vector with different sentence meanings is far away. Fig. 3a shows the embedding effect of a first semantic representation vector, fig. 3b shows the embedding effect of a second semantic representation vector, and the numbers in fig. 3a and fig. 3b represent the embedded sentence numbers. It can be seen that, the first semantic representation vector in fig. 3a has a phenomenon of aggregation, the spatial discrimination after embedding is not high, while the spatial discrimination in fig. 3b is significantly better, the aggregation of the second semantic representation vectors with similar sentence meanings is higher, and the second semantic representation vectors with different sentence meanings can be effectively dispersed in the space, thereby verifying that the embedding effect of the adjusted embedding model is better improved.
And step 210, clustering the second semantic representation vector through a second clustering algorithm to obtain labeling data of multiple categories belonging to one type of the hierarchical label, wherein the hierarchical labels to which the labeling data obtained in each iteration belong are different.
The second clustering algorithm in the step comprises a density clustering algorithm and a clustering algorithm needing to give the number of the categories, so that the data to be classified can be denoised, and meanwhile, the data to be classified can be clustered based on K categories obtained by the first clustering algorithm, so that more accurate data to be classified in each category can be obtained, and the classification quality of the data to be classified in each category can be improved. In the current iteration, the hierarchical labels to which the obtained labeling data belong are the same, for example, the hierarchical labels are primary labels. Through continuous iteration, a secondary label, a tertiary label and the like can be obtained in sequence.
And step 212, taking the labeling data of the multiple categories in the current round as the data to be classified in the next iteration. If the labeled data obtained in the current round is the first-level label, the labeled data corresponding to each first-level label needs to be clustered again, and whether the category of the second-level label which can be subdivided exists under the first-level label is checked. And finally obtaining marking data of a plurality of hierarchies through continuous iterative clustering so as to finish the hierarchy classification of the data to be classified.
In some embodiments, the iteration cutoff condition comprises:
and after the first semantic representation vector is clustered through the first clustering algorithm, the pseudo labeling data of the multiple categories do not exist.
After several rounds of iteration are performed on the data to be classified, new clusters do not appear after the data to be classified is clustered again, that is, new classes are not generated on the basis of the existing classes, the classification is finished at the moment, further classification is not needed, the iteration process is stopped, and the labeled data of a plurality of classes are obtained. After the iteration stops, the remaining unlabeled data is automatically labeled as "other" categories.
In some embodiments, adjusting the parameters of the embedded model by the plurality of classes of pseudo-annotation data comprises:
extracting pseudo-labeling data corresponding to the high-density point which accords with the preset condition of each of the multiple categories; and adjusting the parameters of the embedded model through the pseudo-labeled data corresponding to all the high-density points meeting the preset conditions.
Because the parameters of the embedded model are not updated and adjusted, after the classification data are clustered through the first clustering algorithm, the clustering effect of the clustered data is poor. In order to obtain high-quality training data, in this embodiment, high-density points in each high-density region obtained by the first clustering algorithm are extracted. A high density of points has a higher confidence in its cluster than other points, i.e. can be understood as determining data belonging to the cluster or class. The parameters of the embedded model are adjusted through the high-density points, the spatial discrimination of the embedded vector output by the embedded model can be improved, and a basis is further provided for obtaining accurate classified marking data for subsequent second classification. The preset condition of the high-density point comprises two aspects, wherein one aspect is that the persistence degree value of a cluster where the high-density point is located is larger than a given threshold value, the persistence degree value is used for representing the quality of the cluster, data in the cluster with high quality basically belong to the same category, data in the cluster with low quality are distributed dispersedly, and the categories are not uniform; another aspect is that the probability of the high-density point belonging to its cluster is greater than a given threshold, the probability of the high-density point belonging to its cluster indicating the confidence of the high-density point within its cluster. In this embodiment, the persistence threshold is 0.1 (persistence value range is 0~1) and the confidence threshold is 0.8 (execution value range is 0~1). If the high-density points meeting the preset conditions do not exist, the data does not have high-quality data of the same type; and if the preset condition is met, outputting clusters meeting the persistence threshold value and high-density point data meeting the confidence coefficient threshold value in the clusters.
In some embodiments, adjusting the parameters of the embedded model by the plurality of classes of pseudo-annotation data further comprises: and superposing a classification layer behind the embedded model, and training the embedded model superposed with the classification layer through the pseudo-labeled data of the multiple categories.
In some embodiments, inputting the data to be classified into the adjusted embedding model, and outputting a second semantic representation vector via the adjusted embedding model, comprises:
and removing the classification layer of the embedding model of the overlapped classification layer, and inputting the data to be classified into the embedding model of the overlapped classification layer with the classification layer removed to obtain the second semantic representation vector.
Referring to fig. 5, in the embodiment, when adjusting parameters of an embedded model, a "cross the river to break a bridge" manner is applied to improve the effect of a conventional sentence embedding model, and an embedded model with a superimposed classification layer is trained, but in reality, the classification result of the embedded model with the superimposed classification layer or all network weights of the model are not directly applied, but only the "by-products" of the embedded model, i.e., the first few layers of networks (the part in the dashed box in fig. 5) with the superimposed classification layer of the embedded model, are applied to generate an embedded sentence. And the classification layer is superposed behind the embedded model, and the classification layer of the model is only used for correcting the embedding direction of the embedded model to sentences, so that the effects of reducing the distance between sentences with similar semantics and increasing the distance between sentences with different semantics are achieved. After training is completed, removing the overlapped classification layer, only keeping the part in the dotted line frame in the graph 5 to obtain a fine-tuned embedded model, and embedding the data to be classified through the fine-tuned embedded model to obtain a second semantic representation vector. In this embodiment, the type of the classification layer is not specifically limited, and any existing classification layer capable of achieving a classification effect may be used, and a person skilled in the art can select a suitable classification layer according to actual needs to superimpose the classification layer on the sentence embedding model.
In some embodiments, the second clustering algorithm includes a first sub-clustering algorithm and a second sub-clustering algorithm, and clustering the second semantic representation vector by the second clustering algorithm to obtain labeling data belonging to multiple categories of one of the hierarchical labels includes:
clustering the second semantic representation vector through the first sub-clustering algorithm to remove noise data in the second semantic representation vector;
and clustering the second semantic representation vectors with the noise data removed by the second sub-clustering algorithm to obtain the labeling data of the multiple categories belonging to one hierarchical label.
It should be noted that, in the selection of the clustering algorithm, three conditions need to be satisfied: the algorithm a needs to perform clustering according to the given category number; b, the algorithm can identify noise points; c can cluster non-convex datasets. The traditional K mean algorithm can meet the condition a but not the conditions b and c; the kernel K mean value meets the conditions a and c but does not meet the condition b; and the density-based clustering algorithm (such as DBSCAN, HDBSCAN and the like) meets the conditions b and c but does not meet the condition a. In this embodiment, the advantages of the kernel K mean and the density-based clustering algorithm are combined, the noise points in the data are identified and labeled as "other" by the density-based clustering algorithm capable of processing the noise points, and then the remaining data are clustered by the kernel K mean. By doing so, on one hand, the defect that all unknown classes are forced to be classified into known classes by directly using the K mean value and the derivative algorithm thereof is avoided; on the other hand, the known category number information can be well used, and the clustering effect is improved.
In this embodiment, the first sub-clustering algorithm is a density clustering algorithm, which may be specifically an HDBSCAN algorithm, and removes noise data by the density clustering algorithm. The second sub-clustering algorithm is a kernel K-means algorithm. In principle, convex optimization is performed in the traditional K-means algorithm, so that non-convex distributed data cannot be processed, and the clustering effect is poor when the traditional K-means algorithm is directly applied. And the kernel K-means clustering algorithm is adopted to introduce a kernel function to optimize the algorithm, so that non-convex distributed data can be well processed, and the clustering effect is good.
Note that, the noise data removed here may also be used as data to be classified in the next iteration, so that after the current iteration is completed, data with high confidence in each category is preferentially obtained, and data with low confidence may be labeled as noise data. And through the next clustering of the noise data, the data with higher confidence coefficient in the noise data can be extracted, and the phenomenon of incomplete classification or missing classification of the data to be classified is avoided.
And then, clustering the second semantic representation vector with the noise data removed by adopting a second sub-clustering algorithm, and further obtaining multi-class label data which belongs to the same hierarchical label and has higher accuracy of the current round.
In some embodiments, the using the labeling data of the multiple classes in the current round as the data to be classified in the next iteration includes: and taking the labeling data of each category of the multiple categories as data to be classified of the next iteration respectively.
After the current round obtains the labeled data of multiple categories, the data in each category needs to be classified again, so that the labeled data in each category is used as the data to be labeled and then enters the next round for cluster classification. For example, after the current iteration, the A category and the B category are obtained. And respectively taking the data contained in the category A and the category B as data to be classified, carrying out next round of cluster classification, carrying out cluster classification on the category A to obtain the category C, and carrying out cluster classification on the category B to obtain the category D and the category E. In a specific operation, the data of the category a and the data of the category B may be input to the next round for cluster classification, or the data of the category a and the data of the category B may be input to the next round at the same time. If any one category can not be classified again, jumping out of the iteration loop and ending the whole iteration process. By using the labeled data of each category in the previous round as the data to be classified in the next iteration round, the hierarchical classification of the data to be classified can be realized, and the hierarchical classification and the definition of the classification can be improved.
In some embodiments, said clustering said first semantic representation vector by a first clustering algorithm comprises:
performing dimensionality reduction processing on the first semantic representation vector by adopting a dimensionality reduction algorithm to obtain a low-dimensionality first semantic representation vector;
prior to said clustering said second semantic representation vector by a second clustering algorithm, comprising:
and performing dimensionality reduction processing on the second semantic representation vector by adopting a dimensionality reduction algorithm to obtain a low-dimensionality second semantic representation vector.
And the first semantic representation vector obtained by the embedding model and the second semantic representation vector obtained by the updated embedding model are high-dimensional sentence embedding matrixes. The high-dimensional sentence embedding matrix size is M × N, M is the embedding dimension of a single sentence, depending on the sentence embedding model used, e.g., using sense-Bert, then M =512, using Bert-Chinese, then M =768. Due to the sparsity and nearest neighbor characteristics of the high-dimensional space, no data clusters exist in the high-dimensional space, so that the direct clustering effect on the high-dimensional data is generally poor, and the data needs to be subjected to dimensionality reduction firstly. In this embodiment, a unified Manifold Approximation and Projection-based dimension reduction algorithm UMAP (unified transformed Approximation and Projection) is used to perform dimension reduction on the obtained high-dimensional vector, so as to obtain a low-dimensional embedded matrix. The algorithm firstly learns the manifold structure in a high-dimensional space and then projects the manifold structure to a low-dimensional space, so that the algorithm not only has the same speed advantage as Principal Component Analysis (PCA), but also can retain as much data information as possible and retain more complete global structure information of the whole manifold.
It should be noted that the embodiments of the present application can be further described in the following ways:
fig. 4 shows a process of performing multiple rounds of iterative classification on the data to be classified. As shown in fig. 4, firstly, data 401 to be classified is acquired, the data 401 to be classified is input into a chinese sentence embedding model 402 to obtain a high-dimensional semantic representation vector i 403, dimension reduction processing is performed on the high-dimensional semantic representation vector i 403 through a dimension reduction algorithm 404 to obtain a low-dimensional semantic representation vector i 405, the low-dimensional semantic representation vector i 405 is clustered through a density clustering algorithm 406, whether a high-density region exists is judged 407, if a high-density region exists, pseudo-label data 408 of K categories are acquired, and if a high-density region does not exist, a cycle is skipped.
Training a sentence embedding model (namely a classification model 409) of the superimposed classification layer through K categories of pseudo-label data 408 to obtain a promoted sentence embedding model 410 after training, embedding the data 401 to be classified into the promoted sentence embedding model 410 to obtain a high-dimensional semantic characterization vector II 411, and performing dimension reduction processing on the high-dimensional semantic characterization vector II 411 through a dimension reduction algorithm 412 to obtain a low-dimensional semantic characterization vector II 413. Clustering the low-dimensional semantic representation vector II 413 through a density clustering algorithm 414 to remove the data 415 of noise points, clustering the low-dimensional semantic representation vector II 413 from which the noise is removed through a kernel K-means clustering algorithm 416 to obtain labeling data 417 of K categories, and performing iteration by taking the labeling data 417 of the K categories as the data to be classified in the next round.
It should be noted that the method of the embodiment of the present application may be executed by a single device, such as a computer or a server. The method of the embodiment can also be applied to a distributed scene and completed by the mutual cooperation of a plurality of devices. In such a distributed scenario, one of the multiple devices may only perform one or more steps of the method of the embodiment, and the multiple devices interact with each other to complete the method.
It should be noted that the above describes some embodiments of the present application. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments described above and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The application also provides an automatic hierarchical event classification device.
Referring to fig. 6, the event automatic hierarchical classification apparatus includes:
an obtaining module 602 configured to obtain data to be classified;
an iteration module 604 configured to perform multiple iterations of the classification process of the data to be classified by using an unsupervised learning method;
an output module 606 configured to output, in response to an iteration cutoff condition being satisfied, annotation data corresponding to the data to be classified, the annotation data including at least one hierarchical label and having a number of categories, the hierarchical label being used for characterizing a classification hierarchy of the annotation data.
In some embodiments, the iteration module 604 is further configured to perform the following for each iteration:
inputting the data to be classified into a preset embedding model, and outputting a first semantic representation vector through the embedding model;
clustering the first semantic representation vectors through a first clustering algorithm to obtain pseudo-labeling data of multiple categories;
adjusting parameters of the embedded model through the pseudo labeling data of the plurality of categories;
inputting the data to be classified into an adjusted embedding model, and outputting a second semantic representation vector through the adjusted embedding model;
clustering the second semantic representation vector through a second clustering algorithm to obtain labeling data of multiple categories belonging to one type of the hierarchy label, wherein the hierarchy labels to which the labeling data obtained by each iteration belong are different;
and taking the labeling data of the multiple categories of the current round as the data to be classified of the next iteration.
In some embodiments, the iteration cutoff condition comprises:
and after the first semantic representation vector is clustered through the first clustering algorithm, the pseudo labeling data of the multiple categories do not exist.
In some embodiments, the iteration module 604 is further configured to extract pseudo-labeled data corresponding to the high-density points meeting a preset condition for each of the plurality of categories;
and adjusting the parameters of the embedded model through the pseudo-labeled data corresponding to all the high-density points meeting the preset conditions.
In some embodiments, the iteration module 604 is further configured to overlay a classification layer after the embedded model, and train the embedded model of the overlay classification layer with the pseudo-label data of the plurality of classes.
In some embodiments, the iterative module 604 is further configured to remove a classification layer of the superimposed classification layer, and input the data to be classified into the superimposed classification layer with the classification layer removed, so as to obtain the second semantic representation vector.
In some embodiments, the second clustering algorithm comprises a first sub-clustering algorithm and a second sub-clustering algorithm, and the iterative module 604 is further configured to cluster the second semantic representation vector by the first sub-clustering algorithm to remove noise data in the second semantic representation vector;
and clustering the second semantic representation vectors with the noise data removed by a second sub-clustering algorithm to obtain the labeling data of the multiple categories belonging to one hierarchical label.
In some embodiments, the iteration module 604 is further configured to use the labeling data of each of the plurality of classes as the data to be classified in the next iteration.
In some embodiments, before said clustering said first semantic feature vector by said first clustering algorithm, said iteration module 604 is further configured to apply dimension reduction processing to said first semantic feature vector by using a dimension reduction algorithm to obtain a low-dimensional first semantic feature vector;
prior to said clustering said second semantic representation vector by a second clustering algorithm, comprising:
and performing dimensionality reduction processing on the second semantic representation vector by adopting a dimensionality reduction algorithm to obtain a low-dimensionality second semantic representation vector.
For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. Of course, the functionality of the various modules may be implemented in the same one or more software and/or hardware implementations as the present application.
The apparatus of the foregoing embodiment is used to implement the corresponding event automatic hierarchical classification method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.
The application further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the method for automatically classifying events according to any of the above embodiments is implemented.
Fig. 7 is a schematic diagram illustrating a more specific hardware structure of an electronic device according to this embodiment, where the electronic device may include: a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050. Wherein the processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 are communicatively coupled to each other within the device via bus 1050.
The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present disclosure.
The Memory 1020 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static Memory device, a dynamic Memory device, or the like. The memory 1020 may store an operating system and other application programs, and when the technical solutions provided by the embodiments of the present specification are implemented by software or firmware, the relevant program codes are stored in the memory 1020 and called by the processor 1010 for execution.
The input/output interface 1030 is used for connecting an input/output module to input and output information. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.
The communication interface 1040 is used for connecting a communication module (not shown in the drawings) to implement communication interaction between the present apparatus and other apparatuses. The communication module can realize communication in a wired mode (for example, USB, network cable, etc.), and can also realize communication in a wireless mode (for example, mobile network, WIFI, bluetooth, etc.).
Bus 1050 includes a path that transfers information between various components of the device, such as processor 1010, memory 1020, input/output interface 1030, and communication interface 1040.
It should be noted that although the above-mentioned device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.
The electronic device of the above embodiment is used to implement the corresponding event automatic hierarchical classification method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.
The present application further provides a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method for automatic hierarchical classification of events according to any of the above embodiments.
Computer-readable media of the present embodiments, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.
The computer instructions stored in the storage medium of the above embodiment are used to enable the computer to execute the method for automatically classifying the event hierarchy according to any one of the above embodiments, and have the beneficial effects of corresponding method embodiments, which are not described herein again.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the context of the present application, features from the above embodiments or from different embodiments may also be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the embodiments of the present application as described above, which are not provided in detail for the sake of brevity.
In addition, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown in the provided figures for simplicity of illustration and discussion, and so as not to obscure the embodiments of the application. Furthermore, devices may be shown in block diagram form in order to avoid obscuring embodiments of the application, and this also takes into account the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the embodiments of the application are to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the application, it should be apparent to one skilled in the art that the embodiments of the application can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.
While the present application has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those of ordinary skill in the art in light of the foregoing description. For example, other memory architectures, such as Dynamic RAM (DRAM), may use the discussed embodiments.
The present embodiments are intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present application are intended to be included within the scope of the present application.

Claims (10)

1. An event automatic hierarchical classification method is characterized by comprising the following steps:
acquiring data to be classified;
performing multiple rounds of iteration on the classification process of the data to be classified by adopting an unsupervised learning method;
in response to an iteration cutoff condition being met, outputting annotation data which corresponds to the data to be classified and comprises at least one hierarchy label and a plurality of categories, wherein the hierarchy label is used for representing a classification hierarchy of the annotation data.
2. The method according to claim 1, wherein the classifying process of the data to be classified is performed with multiple iterations by using an unsupervised learning method, which includes:
for each iteration the following operations are performed:
inputting the data to be classified into a preset embedding model, and outputting a first semantic representation vector through the embedding model;
clustering the first semantic representation vector through a first clustering algorithm to obtain pseudo labeling data of multiple categories;
adjusting parameters of the embedded model through the pseudo labeling data of the multiple categories;
inputting the data to be classified into the adjusted embedding model, and outputting a second semantic representation vector through the adjusted embedding model;
clustering the second semantic representation vector through a second clustering algorithm to obtain labeling data of multiple categories belonging to one type of the hierarchy label, wherein the hierarchy labels to which the labeling data obtained in each iteration belong are different;
and taking the labeling data of the multiple categories of the current round as the data to be classified of the next iteration.
3. The method of claim 2, wherein the iteration cutoff condition comprises:
and after the first semantic representation vector is clustered through the first clustering algorithm, the pseudo-annotation data of the multiple categories do not exist.
4. The method of claim 2, wherein adjusting the parameters of the embedded model by the classes of pseudo-annotation data comprises:
extracting pseudo-labeling data corresponding to the high-density point which accords with the preset condition of each of the multiple categories;
and adjusting the parameters of the embedded model through the pseudo-labeled data corresponding to all the high-density points meeting the preset conditions.
5. The method of claim 2, wherein adjusting the parameters of the embedded model by the plurality of classes of pseudo-annotation data further comprises:
and superposing a classification layer behind the embedded model, and training the embedded model superposed with the classification layer through the pseudo-labeled data of the multiple categories.
6. The method of claim 5, wherein inputting the data to be classified into an adjusted embedding model, outputting a second semantic representation vector via the adjusted embedding model, comprises:
and removing the classification layer of the embedding model of the superimposed classification layer, and inputting the data to be classified into the embedding model of the superimposed classification layer from which the classification layer is removed to obtain the second semantic representation vector.
7. The method of claim 2, wherein the second clustering algorithm comprises a first sub-clustering algorithm and a second sub-clustering algorithm,
clustering the second semantic representation vector by a second clustering algorithm to obtain labeling data of a plurality of classes belonging to one of the hierarchical labels, comprising:
clustering the second semantic representation vector through the first sub-clustering algorithm to remove noise data in the second semantic representation vector;
and clustering the second semantic representation vectors with the noise data removed by the second sub-clustering algorithm to obtain the labeling data of the multiple categories belonging to one hierarchical label.
8. The method according to claim 2, wherein the using the labeled data of multiple categories in the current round as the data to be classified in the next iteration comprises:
and taking the labeling data of each category of the multiple categories as data to be classified of the next iteration respectively.
9. The method of claim 2, wherein prior to said clustering said first semantic representation vector by a first clustering algorithm, comprising:
performing dimensionality reduction processing on the first semantic representation vector by adopting a dimensionality reduction algorithm to obtain a low-dimensionality first semantic representation vector;
prior to said clustering said second semantic representation vector by a second clustering algorithm, comprising:
and performing dimensionality reduction processing on the second semantic representation vector by adopting a dimensionality reduction algorithm to obtain a low-dimensionality second semantic representation vector.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 9 when executing the program.
CN202211118551.9A 2022-09-15 2022-09-15 Event automatic hierarchical classification method and electronic equipment Active CN115204318B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211118551.9A CN115204318B (en) 2022-09-15 2022-09-15 Event automatic hierarchical classification method and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211118551.9A CN115204318B (en) 2022-09-15 2022-09-15 Event automatic hierarchical classification method and electronic equipment

Publications (2)

Publication Number Publication Date
CN115204318A true CN115204318A (en) 2022-10-18
CN115204318B CN115204318B (en) 2022-12-02

Family

ID=83572060

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211118551.9A Active CN115204318B (en) 2022-09-15 2022-09-15 Event automatic hierarchical classification method and electronic equipment

Country Status (1)

Country Link
CN (1) CN115204318B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116910377A (en) * 2023-09-14 2023-10-20 长威信息科技发展股份有限公司 Grid event classified search recommendation method and system

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102890698A (en) * 2012-06-20 2013-01-23 杜小勇 Method for automatically describing microblogging topic tag
WO2019214133A1 (en) * 2018-05-08 2019-11-14 华南理工大学 Method for automatically categorizing large-scale customer complaint data
CN111522941A (en) * 2019-02-03 2020-08-11 阿里巴巴集团控股有限公司 Text clustering method and device, electronic equipment and computer storage medium
US20200349174A1 (en) * 2019-04-30 2020-11-05 Amperity, Inc. Clustering of data records with hierarchical cluster ids
US10970493B1 (en) * 2019-10-18 2021-04-06 Clinc, Inc. Systems and methods for slot relation extraction for machine learning task-oriented dialogue systems
CN112989841A (en) * 2021-02-24 2021-06-18 中国搜索信息科技股份有限公司 Semi-supervised learning method for emergency news identification and classification
CN113064990A (en) * 2021-01-04 2021-07-02 上海金融期货信息技术有限公司 Hot event identification method and system based on multi-level clustering
CN113065341A (en) * 2021-03-14 2021-07-02 北京工业大学 Automatic labeling and classifying method for environmental complaint report text
CN113139051A (en) * 2021-03-29 2021-07-20 广东外语外贸大学 Text classification model training method, text classification method, device and medium
CN113312447A (en) * 2021-03-10 2021-08-27 天津大学 Semi-supervised log anomaly detection method based on probability label estimation

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102890698A (en) * 2012-06-20 2013-01-23 杜小勇 Method for automatically describing microblogging topic tag
WO2019214133A1 (en) * 2018-05-08 2019-11-14 华南理工大学 Method for automatically categorizing large-scale customer complaint data
CN111522941A (en) * 2019-02-03 2020-08-11 阿里巴巴集团控股有限公司 Text clustering method and device, electronic equipment and computer storage medium
US20200349174A1 (en) * 2019-04-30 2020-11-05 Amperity, Inc. Clustering of data records with hierarchical cluster ids
US10970493B1 (en) * 2019-10-18 2021-04-06 Clinc, Inc. Systems and methods for slot relation extraction for machine learning task-oriented dialogue systems
CN113064990A (en) * 2021-01-04 2021-07-02 上海金融期货信息技术有限公司 Hot event identification method and system based on multi-level clustering
CN112989841A (en) * 2021-02-24 2021-06-18 中国搜索信息科技股份有限公司 Semi-supervised learning method for emergency news identification and classification
CN113312447A (en) * 2021-03-10 2021-08-27 天津大学 Semi-supervised log anomaly detection method based on probability label estimation
CN113065341A (en) * 2021-03-14 2021-07-02 北京工业大学 Automatic labeling and classifying method for environmental complaint report text
CN113139051A (en) * 2021-03-29 2021-07-20 广东外语外贸大学 Text classification model training method, text classification method, device and medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LIJUAN CAI ET AL.: ""Hierarchical Document Categorization with Support Vector Machines"", 《PROCEEDINGS OF THE 2004 ACM CIKM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT》 *
谢斌红等: ""基于无监督集成聚类的开放关系抽取方法"", 《中文信息学报》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116910377A (en) * 2023-09-14 2023-10-20 长威信息科技发展股份有限公司 Grid event classified search recommendation method and system
CN116910377B (en) * 2023-09-14 2023-12-08 长威信息科技发展股份有限公司 Grid event classified search recommendation method and system

Also Published As

Publication number Publication date
CN115204318B (en) 2022-12-02

Similar Documents

Publication Publication Date Title
US20210027083A1 (en) Automatically detecting user-requested objects in images
CN110785736A (en) Automatic code generation
CN109697451B (en) Similar image clustering method and device, storage medium and electronic equipment
US20220342921A1 (en) Systems and methods for parsing log files using classification and a plurality of neural networks
CN114580263A (en) Knowledge graph-based information system fault prediction method and related equipment
CN116049397B (en) Sensitive information discovery and automatic classification method based on multi-mode fusion
Chen et al. A saliency map fusion method based on weighted DS evidence theory
CN115204318B (en) Event automatic hierarchical classification method and electronic equipment
CN113763385A (en) Video object segmentation method, device, equipment and medium
CN116189172A (en) 3D target detection method, device, storage medium and chip
Wen et al. Semantic segmentation using a GAN and a weakly supervised method based on deep transfer learning
CN116630480B (en) Interactive text-driven image editing method and device and electronic equipment
CN117113174A (en) Model training method and device, storage medium and electronic equipment
CN113139540B (en) Backboard detection method and equipment
CN116152933A (en) Training method, device, equipment and storage medium of anomaly detection model
CN112613072B (en) Information management method, management system and management cloud platform based on archive big data
Xu et al. Situational perception guided image matting
CN114051625A (en) Point cloud data processing method, device, equipment and storage medium
CN115587297A (en) Method, apparatus, device and medium for constructing image recognition model and image recognition
Saathoff Constraint reasoning for region-based image labelling
KR102588531B1 (en) System and method for processing training data
Ma et al. Multimodal Latent Factor Model with Language Constraint for Predicate Detection
CN115442660B (en) Self-supervision countermeasure video abstract extraction method, device, equipment and storage medium
CN116304135B (en) Cross-modal retrieval method, device and medium based on discriminant hidden space learning
Maihami et al. Color features and color spaces applications to the automatic image annotation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant