CN111783995A

CN111783995A - Classification rule obtaining method and device

Info

Publication number: CN111783995A
Application number: CN202010537532.4A
Authority: CN
Inventors: 王聪; 沈承恩; 杨善松
Original assignee: Hisense Visual Technology Co Ltd
Current assignee: Hisense Visual Technology Co Ltd
Priority date: 2020-06-12
Filing date: 2020-06-12
Publication date: 2020-10-16
Anticipated expiration: 2040-06-12
Also published as: CN111783995B

Abstract

According to the classification rule obtaining method and device, data classified by the SWEM model can be used as sample data, and the target class with the minimum first weighing index and the minimum second weighing index in all classes is respectively determined; the first index is minimum, which indicates that the separability of the data in the target class is poor, and the second index is minimum, which indicates that the separability between the two target classes corresponding to the second index is poor. And then, determining overlapped target sample data in the two target categories, and modifying the categories of the target sample data to obviously distinguish the target sample data from other categories to form a new classification rule containing a preset classification rule. According to the technical scheme, the target sample data of the type to be modified can be determined according to the weighing index, a more specific and accurate classification rule is formed, the method can be applied to a multi-version iterative data set, and the application range is wide.

Description

Classification rule obtaining method and device

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for obtaining classification rules.

Background

With the rapid development of artificial intelligence, machine learning and deep learning are widely used in classification tasks, especially in natural language processing tasks, such as: user intent identification, spam identification, and the like. With the development of deep learning, there are various classification models based on deep learning, such as: textCNN model, Transformer model, BERT model, etc.

The method is a main data classification method at present and is used for processing various classification tasks based on a classification model. The current data classification processing flow mainly comprises: firstly, a plurality of classification standards are artificially established according to the service types or the prior knowledge, then the classes of the data sets are sequentially classified according to different classification standards, then the data sets are classified by a deep learning-based classification model, the classification result of the classes of the data sets is sequentially verified according to the classification result of the machines, and the classification standard of the data sets with the unsatisfactory classification result is modified.

However, in the data classification method, on the premise of meeting business requirements, technicians design various classification standards according to personal experience, and only data under all classification standards are input into a classification model and are weighed through a final machine classification result under the condition that which classification standard is more reasonable is not known. It can be seen that, in such a data classification manner, the classification standard subjectively designed by the technician is not applicable to different versions of data sets.

Disclosure of Invention

The application provides a classification rule obtaining method and device, and aims to solve the problem that the application range of classification standards in the current data classification method is small.

In a first aspect, the present application provides a classification rule obtaining method, including:

representing the sample data set into different types of sample data by using an SWEM (single wire array) model, wherein the SWEM model has a preset classification rule;

determining a first target category with the minimum first measuring index and a second target category with the minimum second measuring index between the first target category and the first target category, wherein the first measuring index is used for measuring the separability of sample data in the categories, and the second measuring index is used for measuring the separability of the sample data between the categories;

determining target sample data which are coincident with each other in the first target category and the second target category;

and modifying the class to which the target sample data belongs by combining the preset classification rule to form a new classification rule.

In some embodiments of the present application, the step of determining a first target category with a smallest first metric and a second target category with a smallest second metric with respect to the first target category from among all the categories includes:

respectively calculating a second weighing index between every two categories;

calculating a first weighing index of each category by using the second weighing index related to each category;

determining a first target category with the minimum first weighing index in all categories;

determining a second target class corresponding to the smallest second scale index associated with the first target class among all classes.

In some embodiments of the present application, the second metric between two categories is calculated according to the following formula:

wherein S is_ijRepresenting a second index of measure, B, between class i and class j_ijDenotes the inter-class distance, W, between class i and class j_iIndicating the intra-class distance of class i.

In some embodiments of the present application, the inter-class distance B between the class i and the class j is calculated according to the following formula_ij：

B_ij＝(c_i-c_j)(c_i-c_j)^T，

Wherein, c_iMean vector representing class i, c_jThe mean vector representing category j.

This applicationIn some embodiments, the intra-class distance W is calculated according to the following formula_i：

Wherein x is_kRepresents the kth sample data in class i, c_iThe mean vector representing category i.

In some embodiments of the present application, the first scale index for each category is calculated according to the following formula:

wherein the content of the first and second substances,

a first scale index representing the category i, N represents the number of categories,

indicating the number of sample data in category j.

In some embodiments of the present application, after calculating the first metric for each category by using the second metric associated with each category, the method further includes: and calculating the data set weighing index of the whole sample data set.

In some embodiments of the present application, the dataset weighting index is calculated according to the following formula:

wherein S represents a data set measurement index,

a first scale index representing category i and N represents the number of categories.

In some embodiments of the present application, after modifying the class to which the target sample data belongs by combining the preset classification rule to form a new classification rule, the method further includes:

enabling the SWEM model to represent the sample data set into sample data of different types again by using a new classification rule;

determining a first target category with the minimum first measuring index and a second target category with the minimum second measuring index between the first target category and the first target category again in all categories, wherein the first measuring index is used for measuring the separability of sample data in the categories, and the second measuring index is used for measuring the separability of the sample data between the categories;

determining target sample data which are overlapped in the first target category and the second target category again;

and modifying the class to which the target sample data belongs by combining with a new classification rule until the data set weighing index meets the preset requirement, thereby forming a final classification rule.

In some embodiments of the present application, the step of representing the sample data set into different classes of sample data using the SWEM model comprises:

dividing the sample data set into a plurality of short texts;

performing word segmentation processing on the short text to obtain a plurality of words;

representing each word as a word vector;

and inputting the sample data set into the SWEM model in a word vector form mode to obtain sample data of different categories, wherein the sample data is a dense vector output by the SWEM model.

In a second aspect, the present application further provides a classification rule obtaining apparatus, including:

the system comprises a sample data acquisition module, a classification module and a classification module, wherein the sample data acquisition module is used for expressing sample data sets into different types of sample data by using an SWEM (single wire operation and communication entity) model, and the SWEM model has preset classification rules;

the classification measuring module is used for determining a first target class with the minimum first measuring index and a second target class with the minimum second measuring index between the first target class and the classification measuring module, wherein the first measuring index is used for measuring the separability of sample data in the classification, and the second measuring index is used for measuring the separability of the sample data between the classifications; determining target sample data which are coincident with each other in the first target category and the second target category;

and the class modification module is used for modifying the class to which the target sample data belongs by combining the preset classification rule to form a new classification rule.

As can be seen from the above, the method and device for obtaining the classification rule in the technical scheme of the application can take data classified by the SWEM model as sample data, and respectively determine the target class with the minimum first measurement index and the minimum second measurement index in all classes; the first index is minimum, which indicates that the separability of the data in the target class is poor, and the second index is minimum, which indicates that the separability between the two target classes corresponding to the second index is poor. And then, determining overlapped target sample data in the two target categories, and modifying the categories of the target sample data to obviously distinguish the target sample data from other categories to form a new classification rule containing a preset classification rule. According to the technical scheme, the target sample data of the type to be modified can be determined according to the weighing index, a more specific and accurate classification rule is formed, the method can be applied to a multi-version iterative data set, and the application range is wide.

Drawings

In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic diagram of a discrete vectorization representation method according to an embodiment of the present disclosure;

FIG. 2 is a diagram illustrating a dense vectorization representation method according to an embodiment of the present application;

fig. 3 is a flowchart illustrating a classification rule obtaining method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of the processing results of a SWEM model according to an embodiment of the present application;

fig. 5 is a flowchart illustrating another classification rule obtaining method according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a classification rule obtaining apparatus according to an embodiment of the present application.

Detailed Description

To make the objects, embodiments and advantages of the present application clearer, the following description of exemplary embodiments of the present application will clearly and completely describe the exemplary embodiments of the present application with reference to the accompanying drawings in the exemplary embodiments of the present application, and it is to be understood that the described exemplary embodiments are only a part of the embodiments of the present application, and not all of the embodiments.

All other embodiments, which can be derived by a person skilled in the art from the exemplary embodiments described herein without inventive step, are intended to be within the scope of the claims appended hereto. In addition, while the disclosure herein has been presented in terms of one or more exemplary examples, it should be appreciated that aspects of the disclosure may be implemented solely as a complete embodiment.

It should be noted that the brief descriptions of the terms in the present application are only for the convenience of understanding the embodiments described below, and are not intended to limit the embodiments of the present application. These terms should be understood in their ordinary and customary meaning unless otherwise indicated.

The terms "first," "second," "third," and the like in the description and claims of this application and in the above-described drawings are used for distinguishing between similar or analogous objects or entities and are not necessarily intended to limit the order or sequence of any particular one, Unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein.

Furthermore, the terms "comprises" and "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or device that comprises a list of elements is not necessarily limited to those elements explicitly listed, but may include other elements not expressly listed or inherent to such product or device.

The term "module," as used herein, refers to any known or later developed hardware, software, firmware, artificial intelligence, fuzzy logic, or combination of hardware and/or software code that is capable of performing the functionality associated with that element.

Currently, with the rapid development of artificial intelligence technology, machine learning and deep learning are widely applied in data classification tasks, especially in natural language processing tasks, such as: user intent identification, spam identification, and the like. As deep learning has evolved, there have been a variety of classification models based on deep learning, such as: textCNN model, Transformer model and BERT model, SWEM model, etc.

The current data classification processing flow mainly comprises: firstly, a plurality of classification standards are artificially established according to the service types or the prior knowledge, then the classes of the data sets are sequentially classified according to different classification standards, then the data sets are classified by a deep learning-based classification model, the classification result of the classes of the data sets is sequentially verified according to the classification result of the machines, and the classification standard of the data sets with the unsatisfactory classification result is modified.

However, in the data classification method, on the premise of meeting business requirements, technicians design various classification standards according to personal experience, and only data under all classification standards are input into a classification model and are weighed through a final machine classification result under the condition that which classification standard is more reasonable is not known. It can be seen that, in such a data classification manner, the classification standard subjectively designed by the technician is not applicable to different versions of data sets. In addition, technicians design various classification standards for unified data sets, and need to verify the classification effect of each classification standard one by one, so that the workload is large, and the time is also consumed.

Therefore, in view of the above, embodiments of the present application provide a method and an apparatus for obtaining a classification rule, which can determine a uniform classification rule through a metric, and can be applied to a multi-version data set, so that a designer does not need to manually design multiple classification standards, thereby saving time.

In an embodiment of the present application, the sample data set may be a text data set. In practical cases, there may be two vectorized representations of a text data set, one discrete representation and one dense representation.

Fig. 1 is a schematic diagram of a discrete vectorization representation method according to an embodiment of the present application. In the discrete representation method, feature extraction is mainly performed on short texts converted from a text data set based on artificially designed features. Text features are typically based on part of speech, word labels, characters, word frequency, etc., as well as a variety of hybrid text features, such as: tag + tag, tag + word, word + word, etc. And respectively extracting features of the short text by using different feature extractors, namely, 1 is used for representing the feature vector when the features are satisfied, 0 is used for representing the feature vector when the features are not satisfied, and finally, the whole text data set can be represented as a1 x N-dimensional 0-1 feature vector. However, in this representation method, the design of the features depends mainly on the personal work experience of the technician and the sensitivity to the data, and is highly subjective, and when the text data set changes, the features need to be redesigned, and the workload of the technician is too large, so that the representation method is not frequently used in actual work.

Fig. 2 is a schematic diagram of a dense vectorization representation method according to an embodiment of the present application. In the dense expression method, each word in a short text is expressed into a word vector by using a word2Vec model, then the word vector is used as input, the representation of each sentence is calculated by using a SWEM model, and then the dense vector is output. The representation method does not need manual design of features, greatly reduces the workload of technicians and does not depend on specific classification standards.

Therefore, the classification rule obtaining method provided in the embodiment of the present application relies on the SWEM model to obtain the sample data of the data set classification.

Fig. 3 is a flowchart of a classification rule obtaining method according to an embodiment of the present application, and as shown in fig. 3, the method specifically includes:

step S101, the sample data set is expressed into sample data of different types by using a SWEM model, and the SWEM model has preset classification rules.

There are 4 methods of representing word vectors in the SWEM model, including SWEM-aver (Average-firing), SWEM-Max (Max-firing), SWEM-concat, and SWEM-hier (Hierarchical-firing).

Where SWEM-aver denotes average pooling, which is the averaging of word vectors by elements. This approach is equivalent to considering the information of each word; SWEM-max represents maximum pooling, taking the maximum value for each dimension of the word vector. This approach is equivalent to considering the most significant feature information, with other extraneous or unimportant information being ignored; SWEM-concat represents splicing, and considering that the information of the above two pooling methods is complementary, the variant is to splice the results obtained by the above two pooling methods; SWEM-hier means pooling the hierarchy, which is an average pooling using a local window of size n followed by a global maximum pooling under the above method without considering word order and spatial information.

In practical use, in order to simplify the calculation and extract as much information as possible, SWEM-concat is often used as a sentence representation.

In addition, before the SWEM model is used, some training samples are used for learning and training so that the SWEM model can classify the sample data set according to a specific classification rule. The specific classification rule mentioned here is the preset classification rule mentioned in the embodiments of the present application.

Step S102, a first target category with the minimum first weighing index and a second target category with the minimum second weighing index with the first target category are determined in all categories, wherein the first weighing index is used for weighing the separability of sample data in the categories, and the second weighing index is used for weighing the separability of the sample data in the categories.

Fig. 4 is a schematic diagram of a processing result of the SWEM model shown in the embodiment of the present application, and as shown in fig. 4, a sample data set may be classified into 3 categories after being processed by the SWEM model, and a category 1 includes 3 sample data, a category 2 includes 4 sample data, and a category 3 includes 2 sample data. In this embodiment, the sample data of the classes and under the classes after being processed by the SWEM model are processed.

In this embodiment, for the classification of the sample data set based on a clustering algorithm, the clustering algorithm may partition a data set into different classes or clusters according to a certain specific criterion (such as distance), so that the similarity of data objects in the same cluster is as large as possible, and the difference of data objects in the same cluster is not larger as large as possible. When the clustering performance is measured, the clustering performance is measured by an internal measurement index and an external measurement index. Among them, internal measurement indexes such as contour Coefficient (Silhouette Coefficient), CH Index (Calinski-Harabaz Index), and Davison Castle Index; external measurement indices such as Jaccard's coefficient, FM Index (Fow1kes and Ma1lows Index), Rand Index (RandIndex), DB Index (Davies-Bouldin Index), and Dunn Index (Dunn Index).

In this embodiment, the CH index is used as a measurement index to screen the target category. Meanwhile, the CH index is also used for measuring the separability of the category, and compared with other measuring indexes, the principle of the CH index is simpler and easier to understand, the practicability is strong, and the calculation speed is high. The larger the CH index, the better the separability, and less likely to be duplicated with other classes or other data.

Step S103, determining target sample data which are overlapped in the first target category and the second target category.

In this embodiment, the first target class and the second target class may be both understood as two target classes with a relatively high overlap ratio of internal sample data, and the first target class and the second target class may be screened out through calculation of the measurement index. Further, sample data with a higher degree of coincidence between the first target class and the second target class may be used as target sample data, where the degree of coincidence may be set to a specific threshold for comparison, for example, the similarity between the sample data in the first target class and the sample data in the second target class is calculated, and the sample data with the similarity greater than or equal to the threshold is determined as the target sample data that is coincident.

And step S104, modifying the class to which the target sample data belongs by combining the preset classification rule to form a new classification rule.

The target sample data has high overlap ratio, which indicates that the target sample data is easily classified into two or more different categories, for example, the sample data set is classified into two categories, namely "beauty treatment" and "life service", according to a preset classification rule, but the sample data 1 in the category of the "life service" and the sample data 2 in the category of the "beauty treatment" are data with higher similarity, and the sample data 1 can also be classified into the category of the "beauty treatment" according to the specific content represented by the sample data 1, so the classification mode at this time is obviously not standard enough, and when the classification is performed by using the preset classification rule, the sample data 1 is difficult to be accurately classified into a fixed category. Therefore, in this embodiment, after the first target class and the second target class are determined, the class of the target sample data with a high degree of coincidence needs to be changed, or a class is separately established for the target sample data, or the target sample data is classified into a certain fixed class, so that the previous preset classification rule is perfected, and a new classification rule is formed. Thereafter, the SWEM model may classify the dataset according to the new classification rules.

In some embodiments, the step of determining a first target class with a smallest first metric among all classes, and a second target class with a smallest second metric with the first target class comprises:

step S201, respectively calculating a second weighing index between every two categories; also, in some embodiments, a second metric between two categories may be calculated according to the following formula:

wherein S is_ijRepresenting a second index of measure, B, between class i and class j_ijPresentation classInter-class distance, W, between class i and class j_iIndicating the intra-class distance of class i.

For example, a second metric between class 1 and class 2 is calculated, then

Calculate a metric between class 2 and class 1, then

In some embodiments, the inter-class distance B between class i and class j may be calculated according to the following formula_ij：

B_ij＝(c_i-c_j)(c_i-c_j)^T，

For example, calculate the inter-class distance B between class 1 and class 2₁₂＝(c₁-c₂)(c₁-c₂)^T；

Calculating the inter-class distance B between class 2 and class 1₂₁＝(c₂-c₁)(c₂-c₁)^T。

The larger the inter-class distance, the better the separability between the two classes.

In some embodiments, the intra-class distance W may be calculated according to the following formula_i：

E.g., compute, intra-class distance of class 1

Computing class 2 intra-class distances

The smaller the distance within the class, the better the polymerization within the class.

Step S202, calculating the first weighing index of each category by using the second weighing index related to each category.

For example, a sample data set is divided into 3 categories after passing through the SWEM model, and the second index between two categories is S₁₁、S₁₂、S₁₃、S₂₁、S₂₂、S₂₃、S₃₁、S₃₂、S₃₃. Wherein the second metric associated with Category 1 is S₁₁、S₁₂、S₁₃(ii) a The second metric associated with Category 2 is S₂₁、S₂₂、S₂₃(ii) a The second metric associated with Category 3 is S₃₁、S₃₂、S₃₃。

Also, in some embodiments, the first scale index for each category may be calculated according to the following formula:

wherein the content of the first and second substances,

indicating the number of sample data in category j.

For example, if there are 3 classes, N is 3, and the number of sample data in each class is 3, the first metrics of class 1, class 2, and class 3 are:

step S203, determining a first target category with the minimum first scale index in all categories.

Step S204, determining a second target category corresponding to the minimum second weighing index related to the first target category in all categories.

For example, if the first index of all the categories is the smallest category 1, then the smallest of the second indexes associated with category 1 needs to be further determined, i.e., S is determined₁₁、S₁₂And S₁₃Which is the smallest, if S₁₂Then category 2 may be determined to be the second target category if S₁₃Then category 3 may be determined to be the second target category.

In some embodiments, after calculating the first metric for each category using the second metric associated with each category, the method further comprises: and calculating the data set weighing index of the whole sample data set. Also, in some embodiments, the dataset weighting index may be calculated according to the following formula:

wherein S represents a data set measurement index,

As described above, if there are 3 categories, then

The data set measuring index shown in this embodiment may be used to measure separability of the entire sample data set, the larger S is, the larger S represents the separability of the data set, the less easily the classified sample data among the categories is repeated, and when the value of S reaches a certain threshold, it may be considered that the currently used classification rule achieves an ideal effect. If the calculated value of S does not reach the preset threshold value after the current classification rule is used for classification, the classification of the target sample data needs to be modified to form a new classification rule for reuse. It can be seen that, in some embodiments, after modifying the class to which the target sample data belongs in combination with the preset classification rule to form a new classification rule, the following specific steps may be further included:

step S301, the SWEM model represents the sample data set into sample data of different types again by using a new classification rule;

step S302, a first target category with the minimum first weighing index and a second target category with the minimum second weighing index with the first target category are determined again in all categories, wherein the first weighing index is used for weighing the separability of sample data in the categories, and the second weighing index is used for weighing the separability of the sample data in the categories;

step S303, determining target sample data which are overlapped in the first target category and the second target category again;

and step S304, modifying the class to which the target sample data belongs by combining a new classification rule until the data set measurement index meets the preset requirement, thereby forming a final classification rule.

The preset requirement in this embodiment may refer to a preset threshold, and the like, and if the data set measurement index is greater than or equal to the preset threshold, it indicates that the new classification rule obtained in the current classification rule obtaining process has reached the requirement or the classification result is close to an ideal state, at this time, the next classification rule obtaining process may be terminated, and the classification rule in the current classification rule obtaining process is taken as the final classification rule.

The process of the steps S301 to S304, combined with the process of the steps S101 to S104, may be understood as a loop execution process of the steps S101 to S104 when the data set measurement index does not meet the preset requirement after the data set measurement index is calculated in the step S102, a specific loop flow is shown in fig. 5, where fig. 5 is a flowchart of another classification rule obtaining method shown in the embodiment of the present application, and the method further specifically includes a step S1021 for calculating the data set measurement index; and step S1022, judge whether the data set measures the index and meets the preset requirement, if meet the preset requirement, finish the acquisition process of the classification rule, if not, continue repeating step S101, namely utilize the new classification rule to classify the sample data set.

In some embodiments, the step of representing the sample data set into different classes of sample data using the SWEM model comprises:

step S401, the sample data set is divided into a plurality of short texts.

As can be seen from the above, the sample data set may be a text data set, and before the classification rule obtaining method in the embodiment of the present application is performed, the text data set needs to be divided into a plurality of short texts, where the specific division manner includes, but is not limited to, dividing the text by comma, period, and other symbols.

Step S402, performing word segmentation processing on the short text to obtain a plurality of words.

For example, if a short text is "i want to watch a movie", the short text may be subjected to word segmentation to obtain three words, i "," want to watch "and" movie ".

In step S403, each word is represented as a word vector.

Still taking "i want to see the movie" as an example, three words of "i", "want to see", and "movie" may be respectively represented as 300-dimensional word vectors, and then word vectors corresponding to the three words are added and spliced to form a word vector corresponding to the short text.

And S404, inputting the sample data set into the SWEM model in a word vector mode to obtain sample data of different types, wherein the sample data are dense vectors output by the SWEM model.

In this embodiment, the word vectors input into the SWEM model are word vectors corresponding to short texts, and after being processed by the SWEM model, the word vectors are directly classified according to the existing classification rules of the SWEM model, so as to obtain some types of sample data.

According to the scheme, the classification rule obtaining method provided by the embodiment of the application can take data classified by the SWEM as sample data and respectively determine the target class with the minimum first weighing index and the minimum second weighing index in all classes; the first index is minimum, which indicates that the separability in the target class is poor, and the second index is minimum, which indicates that the separability between the two target classes corresponding to the second index is poor. And then, determining overlapped target sample data in the two target categories, and modifying the categories of the target sample data to obviously distinguish the target sample data from other categories to form a new classification rule containing a preset classification rule. According to the technical scheme, the target sample data of the type to be modified can be determined according to the weighing index, a more specific and accurate classification rule is formed, the method can be applied to a multi-version iterative data set, and the application range is wide.

Fig. 6 is a schematic structural diagram of a classification rule obtaining apparatus according to an embodiment of the present application. As shown in fig. 6, an apparatus for obtaining a classification rule provided in an embodiment of the present application includes:

the sample data acquisition module 61 is configured to represent a sample data set into sample data of different categories by using a SWEM model, where the SWEM model has a preset classification rule; a class measurement module 62, configured to determine, among all classes, a first target class with a minimum first measurement index and a second target class with a minimum second measurement index with respect to the first target class, where the first measurement index is used to measure separability of sample data in the class, and the second measurement index is used to measure separability of sample data between the classes; determining target sample data which are coincident with each other in the first target category and the second target category; and a category modification module 63, configured to modify, in combination with the preset classification rule, a category to which the target sample data belongs, so as to form a new classification rule.

In some embodiments, the classification metric module is further to: respectively calculating a second weighing index between every two categories; calculating a first weighing index of each category by using the second weighing index related to each category; determining a first target category with the minimum first weighing index in all categories; determining a second target class corresponding to the smallest second scale index associated with the first target class among all classes.

In some embodiments, the classification metric module is further to: a second scale index between two categories is calculated according to the following formula:

In some embodiments, the classification metric module is further to: calculating the inter-class distance B between the class i and the class j according to the following formula_ij：

B_ij＝(c_i-c_j)(c_i-c_j)^T，

In some embodiments, the classification metric module is further to: calculating the inter-class distance W according to the following formula_i：

In some embodiments, the classification metric module is further to: the first scale index for each category is calculated according to the following formula:

wherein the content of the first and second substances,

indicating the number of sample data in category j.

In some embodiments, the classification metric module is further to: after the first scaling index of each category is calculated using the second scaling index associated with each category, the data set scaling index of the entire sample data set is calculated.

In some embodiments, the classification metric module is further to: the dataset measure index is calculated according to the following formula:

wherein S represents a data set measurement index,

In some embodiments, the sample data obtaining module is further configured to enable the SWEM model to represent the sample data set again into different types of sample data by using a new classification rule; determining a first target category with the minimum first measuring index and a second target category with the minimum second measuring index between the first target category and the first target category again in all categories, wherein the first measuring index is used for measuring the separability of sample data in the categories, and the second measuring index is used for measuring the separability of the sample data between the categories; determining target sample data which are overlapped in the first target category and the second target category again; and modifying the class to which the target sample data belongs by combining with a new classification rule until the data set weighing index meets the preset requirement, thereby forming a final classification rule.

In some embodiments, the sample data acquisition module is further configured to: dividing the sample data set into a plurality of short texts; performing word segmentation processing on the short text to obtain a plurality of words; representing each word as a word vector; and inputting the sample data set into the SWEM model in a word vector form mode to obtain sample data of different categories, wherein the sample data is a dense vector output by the SWEM model.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

The foregoing description, for purposes of explanation, has been presented in conjunction with specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the embodiments to the precise forms disclosed above. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles and the practical application, to thereby enable others skilled in the art to best utilize the embodiments and various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. A classification rule obtaining method is characterized by comprising the following steps:

2. The method according to claim 1, wherein the step of determining a first target class with a smallest first metric among all classes and a second target class with a smallest second metric with respect to the first target class comprises:

respectively calculating a second weighing index between every two categories;

3. The method of claim 2, wherein the second scaling index between two categories is calculated according to the following formula:

4. The method of claim 3, wherein the inter-class distance B between class i and class j is calculated according to the following formula_ij：

B_ij＝(c_i-c_j)(c_i-c_j)^T，

5. The method of claim 3, wherein the intra-class distance W is calculated according to the following formula_i：

6. A method according to claim 3, characterized in that the first scaling index for each category is calculated according to the following formula:

wherein the content of the first and second substances,

indicating the number of sample data in category j.

7. The method of claim 2, wherein after calculating the first metric for each category using the second metric associated with each category, further comprising: and calculating the data set weighing index of the whole sample data set.

8. The method of claim 7, wherein the dataset weighting index is calculated according to the following formula:

wherein S represents a data set measurement index,

9. The method according to any one of claims 7-8, further comprising, after modifying the class to which the target sample data belongs in combination with the preset classification rule to form a new classification rule:

10. The method of claim 1, wherein the step of representing the sample data sets into different classes of sample data using a SWEM model comprises:

dividing the sample data set into a plurality of short texts;

representing each word as a word vector;

11. A classification rule acquisition apparatus, comprising: