CN111783995A - Classification rule obtaining method and device - Google Patents

Classification rule obtaining method and device Download PDF

Info

Publication number
CN111783995A
CN111783995A CN202010537532.4A CN202010537532A CN111783995A CN 111783995 A CN111783995 A CN 111783995A CN 202010537532 A CN202010537532 A CN 202010537532A CN 111783995 A CN111783995 A CN 111783995A
Authority
CN
China
Prior art keywords
sample data
index
class
target
category
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010537532.4A
Other languages
Chinese (zh)
Other versions
CN111783995B (en
Inventor
王聪
沈承恩
杨善松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hisense Visual Technology Co Ltd
Original Assignee
Hisense Visual Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hisense Visual Technology Co Ltd filed Critical Hisense Visual Technology Co Ltd
Priority to CN202010537532.4A priority Critical patent/CN111783995B/en
Publication of CN111783995A publication Critical patent/CN111783995A/en
Application granted granted Critical
Publication of CN111783995B publication Critical patent/CN111783995B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

According to the classification rule obtaining method and device, data classified by the SWEM model can be used as sample data, and the target class with the minimum first weighing index and the minimum second weighing index in all classes is respectively determined; the first index is minimum, which indicates that the separability of the data in the target class is poor, and the second index is minimum, which indicates that the separability between the two target classes corresponding to the second index is poor. And then, determining overlapped target sample data in the two target categories, and modifying the categories of the target sample data to obviously distinguish the target sample data from other categories to form a new classification rule containing a preset classification rule. According to the technical scheme, the target sample data of the type to be modified can be determined according to the weighing index, a more specific and accurate classification rule is formed, the method can be applied to a multi-version iterative data set, and the application range is wide.

Description

Classification rule obtaining method and device
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for obtaining classification rules.
Background
With the rapid development of artificial intelligence, machine learning and deep learning are widely used in classification tasks, especially in natural language processing tasks, such as: user intent identification, spam identification, and the like. With the development of deep learning, there are various classification models based on deep learning, such as: textCNN model, Transformer model, BERT model, etc.
The method is a main data classification method at present and is used for processing various classification tasks based on a classification model. The current data classification processing flow mainly comprises: firstly, a plurality of classification standards are artificially established according to the service types or the prior knowledge, then the classes of the data sets are sequentially classified according to different classification standards, then the data sets are classified by a deep learning-based classification model, the classification result of the classes of the data sets is sequentially verified according to the classification result of the machines, and the classification standard of the data sets with the unsatisfactory classification result is modified.
However, in the data classification method, on the premise of meeting business requirements, technicians design various classification standards according to personal experience, and only data under all classification standards are input into a classification model and are weighed through a final machine classification result under the condition that which classification standard is more reasonable is not known. It can be seen that, in such a data classification manner, the classification standard subjectively designed by the technician is not applicable to different versions of data sets.
Disclosure of Invention
The application provides a classification rule obtaining method and device, and aims to solve the problem that the application range of classification standards in the current data classification method is small.
In a first aspect, the present application provides a classification rule obtaining method, including:
representing the sample data set into different types of sample data by using an SWEM (single wire array) model, wherein the SWEM model has a preset classification rule;
determining a first target category with the minimum first measuring index and a second target category with the minimum second measuring index between the first target category and the first target category, wherein the first measuring index is used for measuring the separability of sample data in the categories, and the second measuring index is used for measuring the separability of the sample data between the categories;
determining target sample data which are coincident with each other in the first target category and the second target category;
and modifying the class to which the target sample data belongs by combining the preset classification rule to form a new classification rule.
In some embodiments of the present application, the step of determining a first target category with a smallest first metric and a second target category with a smallest second metric with respect to the first target category from among all the categories includes:
respectively calculating a second weighing index between every two categories;
calculating a first weighing index of each category by using the second weighing index related to each category;
determining a first target category with the minimum first weighing index in all categories;
determining a second target class corresponding to the smallest second scale index associated with the first target class among all classes.
In some embodiments of the present application, the second metric between two categories is calculated according to the following formula:
Figure BDA0002537530700000021
wherein S isijRepresenting a second index of measure, B, between class i and class jijDenotes the inter-class distance, W, between class i and class jiIndicating the intra-class distance of class i.
In some embodiments of the present application, the inter-class distance B between the class i and the class j is calculated according to the following formulaij
Bij=(ci-cj)(ci-cj)T
Wherein, ciMean vector representing class i, cjThe mean vector representing category j.
This applicationIn some embodiments, the intra-class distance W is calculated according to the following formulai
Figure BDA0002537530700000022
Wherein x iskRepresents the kth sample data in class i, ciThe mean vector representing category i.
In some embodiments of the present application, the first scale index for each category is calculated according to the following formula:
Figure BDA0002537530700000023
wherein the content of the first and second substances,
Figure BDA0002537530700000024
a first scale index representing the category i, N represents the number of categories,
Figure BDA0002537530700000025
indicating the number of sample data in category j.
In some embodiments of the present application, after calculating the first metric for each category by using the second metric associated with each category, the method further includes: and calculating the data set weighing index of the whole sample data set.
In some embodiments of the present application, the dataset weighting index is calculated according to the following formula:
Figure BDA0002537530700000026
wherein S represents a data set measurement index,
Figure BDA0002537530700000027
a first scale index representing category i and N represents the number of categories.
In some embodiments of the present application, after modifying the class to which the target sample data belongs by combining the preset classification rule to form a new classification rule, the method further includes:
enabling the SWEM model to represent the sample data set into sample data of different types again by using a new classification rule;
determining a first target category with the minimum first measuring index and a second target category with the minimum second measuring index between the first target category and the first target category again in all categories, wherein the first measuring index is used for measuring the separability of sample data in the categories, and the second measuring index is used for measuring the separability of the sample data between the categories;
determining target sample data which are overlapped in the first target category and the second target category again;
and modifying the class to which the target sample data belongs by combining with a new classification rule until the data set weighing index meets the preset requirement, thereby forming a final classification rule.
In some embodiments of the present application, the step of representing the sample data set into different classes of sample data using the SWEM model comprises:
dividing the sample data set into a plurality of short texts;
performing word segmentation processing on the short text to obtain a plurality of words;
representing each word as a word vector;
and inputting the sample data set into the SWEM model in a word vector form mode to obtain sample data of different categories, wherein the sample data is a dense vector output by the SWEM model.
In a second aspect, the present application further provides a classification rule obtaining apparatus, including:
the system comprises a sample data acquisition module, a classification module and a classification module, wherein the sample data acquisition module is used for expressing sample data sets into different types of sample data by using an SWEM (single wire operation and communication entity) model, and the SWEM model has preset classification rules;
the classification measuring module is used for determining a first target class with the minimum first measuring index and a second target class with the minimum second measuring index between the first target class and the classification measuring module, wherein the first measuring index is used for measuring the separability of sample data in the classification, and the second measuring index is used for measuring the separability of the sample data between the classifications; determining target sample data which are coincident with each other in the first target category and the second target category;
and the class modification module is used for modifying the class to which the target sample data belongs by combining the preset classification rule to form a new classification rule.
As can be seen from the above, the method and device for obtaining the classification rule in the technical scheme of the application can take data classified by the SWEM model as sample data, and respectively determine the target class with the minimum first measurement index and the minimum second measurement index in all classes; the first index is minimum, which indicates that the separability of the data in the target class is poor, and the second index is minimum, which indicates that the separability between the two target classes corresponding to the second index is poor. And then, determining overlapped target sample data in the two target categories, and modifying the categories of the target sample data to obviously distinguish the target sample data from other categories to form a new classification rule containing a preset classification rule. According to the technical scheme, the target sample data of the type to be modified can be determined according to the weighing index, a more specific and accurate classification rule is formed, the method can be applied to a multi-version iterative data set, and the application range is wide.
Drawings
In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic diagram of a discrete vectorization representation method according to an embodiment of the present disclosure;
FIG. 2 is a diagram illustrating a dense vectorization representation method according to an embodiment of the present application;
fig. 3 is a flowchart illustrating a classification rule obtaining method according to an embodiment of the present application;
FIG. 4 is a schematic diagram of the processing results of a SWEM model according to an embodiment of the present application;
fig. 5 is a flowchart illustrating another classification rule obtaining method according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of a classification rule obtaining apparatus according to an embodiment of the present application.
Detailed Description
To make the objects, embodiments and advantages of the present application clearer, the following description of exemplary embodiments of the present application will clearly and completely describe the exemplary embodiments of the present application with reference to the accompanying drawings in the exemplary embodiments of the present application, and it is to be understood that the described exemplary embodiments are only a part of the embodiments of the present application, and not all of the embodiments.
All other embodiments, which can be derived by a person skilled in the art from the exemplary embodiments described herein without inventive step, are intended to be within the scope of the claims appended hereto. In addition, while the disclosure herein has been presented in terms of one or more exemplary examples, it should be appreciated that aspects of the disclosure may be implemented solely as a complete embodiment.
It should be noted that the brief descriptions of the terms in the present application are only for the convenience of understanding the embodiments described below, and are not intended to limit the embodiments of the present application. These terms should be understood in their ordinary and customary meaning unless otherwise indicated.
The terms "first," "second," "third," and the like in the description and claims of this application and in the above-described drawings are used for distinguishing between similar or analogous objects or entities and are not necessarily intended to limit the order or sequence of any particular one, Unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein.
Furthermore, the terms "comprises" and "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or device that comprises a list of elements is not necessarily limited to those elements explicitly listed, but may include other elements not expressly listed or inherent to such product or device.
The term "module," as used herein, refers to any known or later developed hardware, software, firmware, artificial intelligence, fuzzy logic, or combination of hardware and/or software code that is capable of performing the functionality associated with that element.
Currently, with the rapid development of artificial intelligence technology, machine learning and deep learning are widely applied in data classification tasks, especially in natural language processing tasks, such as: user intent identification, spam identification, and the like. As deep learning has evolved, there have been a variety of classification models based on deep learning, such as: textCNN model, Transformer model and BERT model, SWEM model, etc.
The current data classification processing flow mainly comprises: firstly, a plurality of classification standards are artificially established according to the service types or the prior knowledge, then the classes of the data sets are sequentially classified according to different classification standards, then the data sets are classified by a deep learning-based classification model, the classification result of the classes of the data sets is sequentially verified according to the classification result of the machines, and the classification standard of the data sets with the unsatisfactory classification result is modified.
However, in the data classification method, on the premise of meeting business requirements, technicians design various classification standards according to personal experience, and only data under all classification standards are input into a classification model and are weighed through a final machine classification result under the condition that which classification standard is more reasonable is not known. It can be seen that, in such a data classification manner, the classification standard subjectively designed by the technician is not applicable to different versions of data sets. In addition, technicians design various classification standards for unified data sets, and need to verify the classification effect of each classification standard one by one, so that the workload is large, and the time is also consumed.
Therefore, in view of the above, embodiments of the present application provide a method and an apparatus for obtaining a classification rule, which can determine a uniform classification rule through a metric, and can be applied to a multi-version data set, so that a designer does not need to manually design multiple classification standards, thereby saving time.
In an embodiment of the present application, the sample data set may be a text data set. In practical cases, there may be two vectorized representations of a text data set, one discrete representation and one dense representation.
Fig. 1 is a schematic diagram of a discrete vectorization representation method according to an embodiment of the present application. In the discrete representation method, feature extraction is mainly performed on short texts converted from a text data set based on artificially designed features. Text features are typically based on part of speech, word labels, characters, word frequency, etc., as well as a variety of hybrid text features, such as: tag + tag, tag + word, word + word, etc. And respectively extracting features of the short text by using different feature extractors, namely, 1 is used for representing the feature vector when the features are satisfied, 0 is used for representing the feature vector when the features are not satisfied, and finally, the whole text data set can be represented as a1 x N-dimensional 0-1 feature vector. However, in this representation method, the design of the features depends mainly on the personal work experience of the technician and the sensitivity to the data, and is highly subjective, and when the text data set changes, the features need to be redesigned, and the workload of the technician is too large, so that the representation method is not frequently used in actual work.
Fig. 2 is a schematic diagram of a dense vectorization representation method according to an embodiment of the present application. In the dense expression method, each word in a short text is expressed into a word vector by using a word2Vec model, then the word vector is used as input, the representation of each sentence is calculated by using a SWEM model, and then the dense vector is output. The representation method does not need manual design of features, greatly reduces the workload of technicians and does not depend on specific classification standards.
Therefore, the classification rule obtaining method provided in the embodiment of the present application relies on the SWEM model to obtain the sample data of the data set classification.
Fig. 3 is a flowchart of a classification rule obtaining method according to an embodiment of the present application, and as shown in fig. 3, the method specifically includes:
step S101, the sample data set is expressed into sample data of different types by using a SWEM model, and the SWEM model has preset classification rules.
There are 4 methods of representing word vectors in the SWEM model, including SWEM-aver (Average-firing), SWEM-Max (Max-firing), SWEM-concat, and SWEM-hier (Hierarchical-firing).
Where SWEM-aver denotes average pooling, which is the averaging of word vectors by elements. This approach is equivalent to considering the information of each word; SWEM-max represents maximum pooling, taking the maximum value for each dimension of the word vector. This approach is equivalent to considering the most significant feature information, with other extraneous or unimportant information being ignored; SWEM-concat represents splicing, and considering that the information of the above two pooling methods is complementary, the variant is to splice the results obtained by the above two pooling methods; SWEM-hier means pooling the hierarchy, which is an average pooling using a local window of size n followed by a global maximum pooling under the above method without considering word order and spatial information.
In practical use, in order to simplify the calculation and extract as much information as possible, SWEM-concat is often used as a sentence representation.
In addition, before the SWEM model is used, some training samples are used for learning and training so that the SWEM model can classify the sample data set according to a specific classification rule. The specific classification rule mentioned here is the preset classification rule mentioned in the embodiments of the present application.
Step S102, a first target category with the minimum first weighing index and a second target category with the minimum second weighing index with the first target category are determined in all categories, wherein the first weighing index is used for weighing the separability of sample data in the categories, and the second weighing index is used for weighing the separability of the sample data in the categories.
Fig. 4 is a schematic diagram of a processing result of the SWEM model shown in the embodiment of the present application, and as shown in fig. 4, a sample data set may be classified into 3 categories after being processed by the SWEM model, and a category 1 includes 3 sample data, a category 2 includes 4 sample data, and a category 3 includes 2 sample data. In this embodiment, the sample data of the classes and under the classes after being processed by the SWEM model are processed.
In this embodiment, for the classification of the sample data set based on a clustering algorithm, the clustering algorithm may partition a data set into different classes or clusters according to a certain specific criterion (such as distance), so that the similarity of data objects in the same cluster is as large as possible, and the difference of data objects in the same cluster is not larger as large as possible. When the clustering performance is measured, the clustering performance is measured by an internal measurement index and an external measurement index. Among them, internal measurement indexes such as contour Coefficient (Silhouette Coefficient), CH Index (Calinski-Harabaz Index), and Davison Castle Index; external measurement indices such as Jaccard's coefficient, FM Index (Fow1kes and Ma1lows Index), Rand Index (RandIndex), DB Index (Davies-Bouldin Index), and Dunn Index (Dunn Index).
In this embodiment, the CH index is used as a measurement index to screen the target category. Meanwhile, the CH index is also used for measuring the separability of the category, and compared with other measuring indexes, the principle of the CH index is simpler and easier to understand, the practicability is strong, and the calculation speed is high. The larger the CH index, the better the separability, and less likely to be duplicated with other classes or other data.
Step S103, determining target sample data which are overlapped in the first target category and the second target category.
In this embodiment, the first target class and the second target class may be both understood as two target classes with a relatively high overlap ratio of internal sample data, and the first target class and the second target class may be screened out through calculation of the measurement index. Further, sample data with a higher degree of coincidence between the first target class and the second target class may be used as target sample data, where the degree of coincidence may be set to a specific threshold for comparison, for example, the similarity between the sample data in the first target class and the sample data in the second target class is calculated, and the sample data with the similarity greater than or equal to the threshold is determined as the target sample data that is coincident.
And step S104, modifying the class to which the target sample data belongs by combining the preset classification rule to form a new classification rule.
The target sample data has high overlap ratio, which indicates that the target sample data is easily classified into two or more different categories, for example, the sample data set is classified into two categories, namely "beauty treatment" and "life service", according to a preset classification rule, but the sample data 1 in the category of the "life service" and the sample data 2 in the category of the "beauty treatment" are data with higher similarity, and the sample data 1 can also be classified into the category of the "beauty treatment" according to the specific content represented by the sample data 1, so the classification mode at this time is obviously not standard enough, and when the classification is performed by using the preset classification rule, the sample data 1 is difficult to be accurately classified into a fixed category. Therefore, in this embodiment, after the first target class and the second target class are determined, the class of the target sample data with a high degree of coincidence needs to be changed, or a class is separately established for the target sample data, or the target sample data is classified into a certain fixed class, so that the previous preset classification rule is perfected, and a new classification rule is formed. Thereafter, the SWEM model may classify the dataset according to the new classification rules.
In some embodiments, the step of determining a first target class with a smallest first metric among all classes, and a second target class with a smallest second metric with the first target class comprises:
step S201, respectively calculating a second weighing index between every two categories; also, in some embodiments, a second metric between two categories may be calculated according to the following formula:
Figure BDA0002537530700000071
wherein S isijRepresenting a second index of measure, B, between class i and class jijPresentation classInter-class distance, W, between class i and class jiIndicating the intra-class distance of class i.
For example, a second metric between class 1 and class 2 is calculated, then
Figure BDA0002537530700000072
Calculate a metric between class 2 and class 1, then
Figure BDA0002537530700000073
In some embodiments, the inter-class distance B between class i and class j may be calculated according to the following formulaij
Bij=(ci-cj)(ci-cj)T
Wherein, ciMean vector representing class i, cjThe mean vector representing category j.
For example, calculate the inter-class distance B between class 1 and class 212=(c1-c2)(c1-c2)T
Calculating the inter-class distance B between class 2 and class 121=(c2-c1)(c2-c1)T
The larger the inter-class distance, the better the separability between the two classes.
In some embodiments, the intra-class distance W may be calculated according to the following formulai
Figure BDA0002537530700000074
Wherein x iskRepresents the kth sample data in class i, ciThe mean vector representing category i.
E.g., compute, intra-class distance of class 1
Figure BDA0002537530700000075
Computing class 2 intra-class distances
Figure BDA0002537530700000076
The smaller the distance within the class, the better the polymerization within the class.
Step S202, calculating the first weighing index of each category by using the second weighing index related to each category.
For example, a sample data set is divided into 3 categories after passing through the SWEM model, and the second index between two categories is S11、S12、S13、S21、S22、S23、S31、S32、S33. Wherein the second metric associated with Category 1 is S11、S12、S13(ii) a The second metric associated with Category 2 is S21、S22、S23(ii) a The second metric associated with Category 3 is S31、S32、S33
Also, in some embodiments, the first scale index for each category may be calculated according to the following formula:
Figure BDA0002537530700000077
wherein the content of the first and second substances,
Figure BDA0002537530700000078
a first scale index representing the category i, N represents the number of categories,
Figure BDA0002537530700000079
indicating the number of sample data in category j.
For example, if there are 3 classes, N is 3, and the number of sample data in each class is 3, the first metrics of class 1, class 2, and class 3 are:
Figure BDA0002537530700000081
step S203, determining a first target category with the minimum first scale index in all categories.
Step S204, determining a second target category corresponding to the minimum second weighing index related to the first target category in all categories.
For example, if the first index of all the categories is the smallest category 1, then the smallest of the second indexes associated with category 1 needs to be further determined, i.e., S is determined11、S12And S13Which is the smallest, if S12Then category 2 may be determined to be the second target category if S13Then category 3 may be determined to be the second target category.
In some embodiments, after calculating the first metric for each category using the second metric associated with each category, the method further comprises: and calculating the data set weighing index of the whole sample data set. Also, in some embodiments, the dataset weighting index may be calculated according to the following formula:
Figure BDA0002537530700000082
wherein S represents a data set measurement index,
Figure BDA0002537530700000083
a first scale index representing category i and N represents the number of categories.
As described above, if there are 3 categories, then
Figure BDA0002537530700000084
The data set measuring index shown in this embodiment may be used to measure separability of the entire sample data set, the larger S is, the larger S represents the separability of the data set, the less easily the classified sample data among the categories is repeated, and when the value of S reaches a certain threshold, it may be considered that the currently used classification rule achieves an ideal effect. If the calculated value of S does not reach the preset threshold value after the current classification rule is used for classification, the classification of the target sample data needs to be modified to form a new classification rule for reuse. It can be seen that, in some embodiments, after modifying the class to which the target sample data belongs in combination with the preset classification rule to form a new classification rule, the following specific steps may be further included:
step S301, the SWEM model represents the sample data set into sample data of different types again by using a new classification rule;
step S302, a first target category with the minimum first weighing index and a second target category with the minimum second weighing index with the first target category are determined again in all categories, wherein the first weighing index is used for weighing the separability of sample data in the categories, and the second weighing index is used for weighing the separability of the sample data in the categories;
step S303, determining target sample data which are overlapped in the first target category and the second target category again;
and step S304, modifying the class to which the target sample data belongs by combining a new classification rule until the data set measurement index meets the preset requirement, thereby forming a final classification rule.
The preset requirement in this embodiment may refer to a preset threshold, and the like, and if the data set measurement index is greater than or equal to the preset threshold, it indicates that the new classification rule obtained in the current classification rule obtaining process has reached the requirement or the classification result is close to an ideal state, at this time, the next classification rule obtaining process may be terminated, and the classification rule in the current classification rule obtaining process is taken as the final classification rule.
The process of the steps S301 to S304, combined with the process of the steps S101 to S104, may be understood as a loop execution process of the steps S101 to S104 when the data set measurement index does not meet the preset requirement after the data set measurement index is calculated in the step S102, a specific loop flow is shown in fig. 5, where fig. 5 is a flowchart of another classification rule obtaining method shown in the embodiment of the present application, and the method further specifically includes a step S1021 for calculating the data set measurement index; and step S1022, judge whether the data set measures the index and meets the preset requirement, if meet the preset requirement, finish the acquisition process of the classification rule, if not, continue repeating step S101, namely utilize the new classification rule to classify the sample data set.
In some embodiments, the step of representing the sample data set into different classes of sample data using the SWEM model comprises:
step S401, the sample data set is divided into a plurality of short texts.
As can be seen from the above, the sample data set may be a text data set, and before the classification rule obtaining method in the embodiment of the present application is performed, the text data set needs to be divided into a plurality of short texts, where the specific division manner includes, but is not limited to, dividing the text by comma, period, and other symbols.
Step S402, performing word segmentation processing on the short text to obtain a plurality of words.
For example, if a short text is "i want to watch a movie", the short text may be subjected to word segmentation to obtain three words, i "," want to watch "and" movie ".
In step S403, each word is represented as a word vector.
Still taking "i want to see the movie" as an example, three words of "i", "want to see", and "movie" may be respectively represented as 300-dimensional word vectors, and then word vectors corresponding to the three words are added and spliced to form a word vector corresponding to the short text.
And S404, inputting the sample data set into the SWEM model in a word vector mode to obtain sample data of different types, wherein the sample data are dense vectors output by the SWEM model.
In this embodiment, the word vectors input into the SWEM model are word vectors corresponding to short texts, and after being processed by the SWEM model, the word vectors are directly classified according to the existing classification rules of the SWEM model, so as to obtain some types of sample data.
According to the scheme, the classification rule obtaining method provided by the embodiment of the application can take data classified by the SWEM as sample data and respectively determine the target class with the minimum first weighing index and the minimum second weighing index in all classes; the first index is minimum, which indicates that the separability in the target class is poor, and the second index is minimum, which indicates that the separability between the two target classes corresponding to the second index is poor. And then, determining overlapped target sample data in the two target categories, and modifying the categories of the target sample data to obviously distinguish the target sample data from other categories to form a new classification rule containing a preset classification rule. According to the technical scheme, the target sample data of the type to be modified can be determined according to the weighing index, a more specific and accurate classification rule is formed, the method can be applied to a multi-version iterative data set, and the application range is wide.
Fig. 6 is a schematic structural diagram of a classification rule obtaining apparatus according to an embodiment of the present application. As shown in fig. 6, an apparatus for obtaining a classification rule provided in an embodiment of the present application includes:
the sample data acquisition module 61 is configured to represent a sample data set into sample data of different categories by using a SWEM model, where the SWEM model has a preset classification rule; a class measurement module 62, configured to determine, among all classes, a first target class with a minimum first measurement index and a second target class with a minimum second measurement index with respect to the first target class, where the first measurement index is used to measure separability of sample data in the class, and the second measurement index is used to measure separability of sample data between the classes; determining target sample data which are coincident with each other in the first target category and the second target category; and a category modification module 63, configured to modify, in combination with the preset classification rule, a category to which the target sample data belongs, so as to form a new classification rule.
In some embodiments, the classification metric module is further to: respectively calculating a second weighing index between every two categories; calculating a first weighing index of each category by using the second weighing index related to each category; determining a first target category with the minimum first weighing index in all categories; determining a second target class corresponding to the smallest second scale index associated with the first target class among all classes.
In some embodiments, the classification metric module is further to: a second scale index between two categories is calculated according to the following formula:
Figure BDA0002537530700000101
wherein S isijRepresenting a second index of measure, B, between class i and class jijDenotes the inter-class distance, W, between class i and class jiIndicating the intra-class distance of class i.
In some embodiments, the classification metric module is further to: calculating the inter-class distance B between the class i and the class j according to the following formulaij
Bij=(ci-cj)(ci-cj)T
Wherein, ciMean vector representing class i, cjThe mean vector representing category j.
In some embodiments, the classification metric module is further to: calculating the inter-class distance W according to the following formulai
Figure BDA0002537530700000102
Wherein x iskRepresents the kth sample data in class i, ciThe mean vector representing category i.
In some embodiments, the classification metric module is further to: the first scale index for each category is calculated according to the following formula:
Figure BDA0002537530700000103
wherein the content of the first and second substances,
Figure BDA0002537530700000104
a first scale index representing the category i, N represents the number of categories,
Figure BDA0002537530700000105
indicating the number of sample data in category j.
In some embodiments, the classification metric module is further to: after the first scaling index of each category is calculated using the second scaling index associated with each category, the data set scaling index of the entire sample data set is calculated.
In some embodiments, the classification metric module is further to: the dataset measure index is calculated according to the following formula:
Figure BDA0002537530700000106
wherein S represents a data set measurement index,
Figure BDA0002537530700000111
a first scale index representing category i and N represents the number of categories.
In some embodiments, the sample data obtaining module is further configured to enable the SWEM model to represent the sample data set again into different types of sample data by using a new classification rule; determining a first target category with the minimum first measuring index and a second target category with the minimum second measuring index between the first target category and the first target category again in all categories, wherein the first measuring index is used for measuring the separability of sample data in the categories, and the second measuring index is used for measuring the separability of the sample data between the categories; determining target sample data which are overlapped in the first target category and the second target category again; and modifying the class to which the target sample data belongs by combining with a new classification rule until the data set weighing index meets the preset requirement, thereby forming a final classification rule.
In some embodiments, the sample data acquisition module is further configured to: dividing the sample data set into a plurality of short texts; performing word segmentation processing on the short text to obtain a plurality of words; representing each word as a word vector; and inputting the sample data set into the SWEM model in a word vector form mode to obtain sample data of different categories, wherein the sample data is a dense vector output by the SWEM model.
Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.
The foregoing description, for purposes of explanation, has been presented in conjunction with specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the embodiments to the precise forms disclosed above. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles and the practical application, to thereby enable others skilled in the art to best utilize the embodiments and various embodiments with various modifications as are suited to the particular use contemplated.

Claims (11)

1. A classification rule obtaining method is characterized by comprising the following steps:
representing the sample data set into different types of sample data by using an SWEM (single wire array) model, wherein the SWEM model has a preset classification rule;
determining a first target category with the minimum first measuring index and a second target category with the minimum second measuring index between the first target category and the first target category, wherein the first measuring index is used for measuring the separability of sample data in the categories, and the second measuring index is used for measuring the separability of the sample data between the categories;
determining target sample data which are coincident with each other in the first target category and the second target category;
and modifying the class to which the target sample data belongs by combining the preset classification rule to form a new classification rule.
2. The method according to claim 1, wherein the step of determining a first target class with a smallest first metric among all classes and a second target class with a smallest second metric with respect to the first target class comprises:
respectively calculating a second weighing index between every two categories;
calculating a first weighing index of each category by using the second weighing index related to each category;
determining a first target category with the minimum first weighing index in all categories;
determining a second target class corresponding to the smallest second scale index associated with the first target class among all classes.
3. The method of claim 2, wherein the second scaling index between two categories is calculated according to the following formula:
Figure FDA0002537530690000011
wherein S isijRepresenting a second index of measure, B, between class i and class jijDenotes the inter-class distance, W, between class i and class jiIndicating the intra-class distance of class i.
4. The method of claim 3, wherein the inter-class distance B between class i and class j is calculated according to the following formulaij
Bij=(ci-cj)(ci-cj)T
Wherein, ciMean vector representing class i, cjThe mean vector representing category j.
5. The method of claim 3, wherein the intra-class distance W is calculated according to the following formulai
Figure FDA0002537530690000012
Wherein x iskRepresents the kth sample data in class i, ciThe mean vector representing category i.
6. A method according to claim 3, characterized in that the first scaling index for each category is calculated according to the following formula:
Figure FDA0002537530690000013
wherein the content of the first and second substances,
Figure FDA0002537530690000024
a first scale index representing the category i, N represents the number of categories,
Figure FDA0002537530690000021
indicating the number of sample data in category j.
7. The method of claim 2, wherein after calculating the first metric for each category using the second metric associated with each category, further comprising: and calculating the data set weighing index of the whole sample data set.
8. The method of claim 7, wherein the dataset weighting index is calculated according to the following formula:
Figure FDA0002537530690000022
wherein S represents a data set measurement index,
Figure FDA0002537530690000023
a first scale index representing category i and N represents the number of categories.
9. The method according to any one of claims 7-8, further comprising, after modifying the class to which the target sample data belongs in combination with the preset classification rule to form a new classification rule:
enabling the SWEM model to represent the sample data set into sample data of different types again by using a new classification rule;
determining a first target category with the minimum first measuring index and a second target category with the minimum second measuring index between the first target category and the first target category again in all categories, wherein the first measuring index is used for measuring the separability of sample data in the categories, and the second measuring index is used for measuring the separability of the sample data between the categories;
determining target sample data which are overlapped in the first target category and the second target category again;
and modifying the class to which the target sample data belongs by combining with a new classification rule until the data set weighing index meets the preset requirement, thereby forming a final classification rule.
10. The method of claim 1, wherein the step of representing the sample data sets into different classes of sample data using a SWEM model comprises:
dividing the sample data set into a plurality of short texts;
performing word segmentation processing on the short text to obtain a plurality of words;
representing each word as a word vector;
and inputting the sample data set into the SWEM model in a word vector form mode to obtain sample data of different categories, wherein the sample data is a dense vector output by the SWEM model.
11. A classification rule acquisition apparatus, comprising:
the system comprises a sample data acquisition module, a classification module and a classification module, wherein the sample data acquisition module is used for expressing sample data sets into different types of sample data by using an SWEM (single wire operation and communication entity) model, and the SWEM model has preset classification rules;
the classification measuring module is used for determining a first target class with the minimum first measuring index and a second target class with the minimum second measuring index between the first target class and the classification measuring module, wherein the first measuring index is used for measuring the separability of sample data in the classification, and the second measuring index is used for measuring the separability of the sample data between the classifications; determining target sample data which are coincident with each other in the first target category and the second target category;
and the class modification module is used for modifying the class to which the target sample data belongs by combining the preset classification rule to form a new classification rule.
CN202010537532.4A 2020-06-12 2020-06-12 Classification rule obtaining method and device Active CN111783995B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010537532.4A CN111783995B (en) 2020-06-12 2020-06-12 Classification rule obtaining method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010537532.4A CN111783995B (en) 2020-06-12 2020-06-12 Classification rule obtaining method and device

Publications (2)

Publication Number Publication Date
CN111783995A true CN111783995A (en) 2020-10-16
CN111783995B CN111783995B (en) 2022-11-29

Family

ID=72756388

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010537532.4A Active CN111783995B (en) 2020-06-12 2020-06-12 Classification rule obtaining method and device

Country Status (1)

Country Link
CN (1) CN111783995B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101908055A (en) * 2010-03-05 2010-12-08 黑龙江工程学院 Method for setting information classification threshold for optimizing lam percentage and information filtering system using same
CN106528771A (en) * 2016-11-07 2017-03-22 中山大学 Fast structural SVM text classification optimization algorithm
CN108628971A (en) * 2018-04-24 2018-10-09 深圳前海微众银行股份有限公司 File classification method, text classifier and the storage medium of imbalanced data sets
CN109471944A (en) * 2018-11-12 2019-03-15 中山大学 Training method, device and the readable storage medium storing program for executing of textual classification model
CN110069627A (en) * 2017-11-20 2019-07-30 中国移动通信集团上海有限公司 Classification method, device, electronic equipment and the storage medium of short text
CN110399490A (en) * 2019-07-17 2019-11-01 武汉斗鱼网络科技有限公司 A kind of barrage file classification method, device, equipment and storage medium
CN110443281A (en) * 2019-07-05 2019-11-12 重庆信科设计有限公司 Adaptive oversampler method based on HDBSCAN cluster
CN111143569A (en) * 2019-12-31 2020-05-12 腾讯科技(深圳)有限公司 Data processing method and device and computer readable storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101908055A (en) * 2010-03-05 2010-12-08 黑龙江工程学院 Method for setting information classification threshold for optimizing lam percentage and information filtering system using same
CN106528771A (en) * 2016-11-07 2017-03-22 中山大学 Fast structural SVM text classification optimization algorithm
CN110069627A (en) * 2017-11-20 2019-07-30 中国移动通信集团上海有限公司 Classification method, device, electronic equipment and the storage medium of short text
CN108628971A (en) * 2018-04-24 2018-10-09 深圳前海微众银行股份有限公司 File classification method, text classifier and the storage medium of imbalanced data sets
CN109471944A (en) * 2018-11-12 2019-03-15 中山大学 Training method, device and the readable storage medium storing program for executing of textual classification model
CN110443281A (en) * 2019-07-05 2019-11-12 重庆信科设计有限公司 Adaptive oversampler method based on HDBSCAN cluster
CN110399490A (en) * 2019-07-17 2019-11-01 武汉斗鱼网络科技有限公司 A kind of barrage file classification method, device, equipment and storage medium
CN111143569A (en) * 2019-12-31 2020-05-12 腾讯科技(深圳)有限公司 Data processing method and device and computer readable storage medium

Also Published As

Publication number Publication date
CN111783995B (en) 2022-11-29

Similar Documents

Publication Publication Date Title
CN109241524B (en) Semantic analysis method and device, computer-readable storage medium and electronic equipment
CN109583332B (en) Face recognition method, face recognition system, medium, and electronic device
WO2021174757A1 (en) Method and apparatus for recognizing emotion in voice, electronic device and computer-readable storage medium
CN110069709B (en) Intention recognition method, device, computer readable medium and electronic equipment
TW201909112A (en) Image feature acquisition
WO2020082734A1 (en) Text emotion recognition method and apparatus, electronic device, and computer non-volatile readable storage medium
KR20200127020A (en) Computer-readable storage medium storing method, apparatus and instructions for matching semantic text data with tags
CN111447574B (en) Short message classification method, device, system and storage medium
WO2022042297A1 (en) Text clustering method, apparatus, electronic device, and storage medium
CN113836303A (en) Text type identification method and device, computer equipment and medium
CN113704389A (en) Data evaluation method and device, computer equipment and storage medium
CN110348516B (en) Data processing method, data processing device, storage medium and electronic equipment
CN116467141A (en) Log recognition model training, log clustering method, related system and equipment
CN117422182A (en) Data prediction method, device and storage medium
CN113515593A (en) Topic detection method and device based on clustering model and computer equipment
CN110826616B (en) Information processing method and device, electronic equipment and storage medium
CN111783995B (en) Classification rule obtaining method and device
CN116451081A (en) Data drift detection method, device, terminal and storage medium
US9342795B1 (en) Assisted learning for document classification
CN113032251B (en) Method, device and storage medium for determining service quality of application program
CN115345158A (en) New word discovery method, device, equipment and storage medium based on unsupervised learning
CN114529191A (en) Method and apparatus for risk identification
CN114842301A (en) Semi-supervised training method of image annotation model
CN114332529A (en) Training method and device for image classification model, electronic equipment and storage medium
CN110347824B (en) Method for determining optimal number of topics of LDA topic model based on vocabulary similarity

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant