CN111931829B

CN111931829B - Classifier screening method, system, storage medium and computer equipment

Info

Publication number: CN111931829B
Application number: CN202010722712.XA
Authority: CN
Inventors: 陈泽鹏; 徐维超; 陈昌润
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2020-07-24
Filing date: 2020-07-24
Publication date: 2023-09-01
Anticipated expiration: 2040-07-24
Also published as: CN111931829A

Abstract

The utility model provides a classifier screening method, a classifier screening system, a storage medium and computer equipment, wherein the method comprises the following steps: preprocessing a data set, and converting the text of the data set into a vectorization matrix; acquiring a variance expression of the AUC of the classifier under the corresponding classification problem according to the vectorization matrix; reconstructing the variance expression by using a dynamic programming term to obtain an AUC expression based on dynamic programming; and acquiring an AUC sample estimated value of the classifier to be selected according to the AUC expression based on dynamic programming, and taking the classifier to be selected with the maximum AUC sample estimated value as the optimal classifier under the corresponding classification problem.

Description

Classifier screening method, system, storage medium and computer equipment

Technical Field

The present utility model relates to the field of machine learning technologies, and in particular, to a classifier screening method, a classifier screening system, a storage medium, and a computer device.

Background

Machine learning is generally divided into two types: supervised learning and unsupervised learning. Classification is a supervised learning process, which categories are known in the target database, and all that is required to do is to assign each record to a corresponding category, and the output of the classification problem is no longer a continuous value, but a discrete value, which is used to specify which category it belongs to. For probability classifiers, the corresponding discrete (class II) classifier can be obtained by setting a threshold value: when the output of the classifier exceeds the threshold, a result of 1 is output, otherwise 0. Each threshold may produce a different point in the plane of the receiver operating characteristic (Receiver Operating Characteristic, hereinafter ROC). Conceptually, we can generate a curve by plotting the corresponding points of each threshold in ROC space, which is a ROC curve essentially representing the trade-off relationship between true and false positive rates at different decision thresholds. The area under the curve (Area Under the Curve, AUC) can be used to evaluate whether a classifier can effectively distinguish positive and negative samples in a particular problem. The classifier is evaluated through an AUC algorithm, so that a more suitable classifier can be screened.

As shown in AUC-based classifier performance evaluation problem study (Jilin university, jiang Shuai), the existing method has the advantages of narrow applicability, poor efficiency, complex treatment process, deviation of results and obvious limitation.

Disclosure of Invention

Aiming at the limitation of the prior art, the utility model provides a classifier screening method, a classifier screening system, a storage medium and computer equipment, and the technical scheme adopted by the utility model is as follows:

a classifier screening method comprising the steps of:

preprocessing a data set, and converting the text of the data set into a vectorization matrix;

acquiring a variance expression of the AUC of the classifier under the corresponding classification problem according to the vectorization matrix;

reconstructing the variance expression by using a dynamic programming term to obtain an AUC expression based on dynamic programming;

and acquiring an AUC sample estimated value of the classifier to be selected according to the AUC expression based on dynamic programming, and taking the classifier to be selected with the maximum AUC sample estimated value as the optimal classifier under the corresponding classification problem.

Compared with the prior art, the method can be used as an unbiased estimation of the MWUS variance, the time complexity is linear and far lower than that of the conventional method, and is equivalent to the most advanced rank-based method; meanwhile, the processing speed is far higher than that of other similar methods, and the most suitable classifier can be quickly selected. In addition to these advantages, the structure of the algorithm can be easily expanded to the three-classification or multi-classification condition, and the application range of the method is wide. In addition, the method provided by the utility model can also be used for improving the current cell detection method.

As a preferred solution, under the classification problem, the AUC expression based on dynamic programming is as follows:

wherein ,representing an AUC sample estimate; m represents the positive sample number of the data set, and n represents the negative sample number of the data set; x is X _i A positive sample sequence representing the dataset, Y _j A negative sample sequence representing the dataset; />Indicating an exponential function, indicating that the function takes a value of 1 when the statement in the bracket is true, and indicating that the function takes a value of 0 when the statement in the bracket is false; dynamic programming item S ₁ The satisfied range is epsilon (x=y); dynamic programming item S ₂ The satisfied range is epsilon (X>Y)。

As a preferred solution, under the classification problem, the dynamic programming term S ₁ ,S ₂ ,....S ₉ ,S ₁₀ The corresponding satisfaction ranges are as follows:

S ₁ ，ε(X＝Y)；S ₂ ，ε(X>Y)；S ₃ ，ε(X>Y>Y′)orε(X>Y>Y)；

S ₄ ，ε(X>Y＝Y′)；S ₅ ，ε(X＝Y>Y)orε(X＝Y>Y′)；S ₆ ，ε(X＝Y＝Y′)；

S ₇ ，ε(X＝X>Y)orε(X′＝X>Y)；S ₈ ，ε(X＝X>Y)；

S ₉ ，ε(X＝X>Y)orε(X>X＝Y)；S ₁₀ ，ε(X＝X＝Y)；

wherein X and Y respectively represent independent identical distribution sample sets selected from two parent distributions, and X 'and Y' respectively correspond to the independent identical distribution sample sets of which X and Y have the same element.

Further, the expression of the dynamic programming term is as follows:

m represents the positive sample number of the data set, and n represents the negative sample number of the data set; x is X _i A positive sample sequence representing the dataset, Y _j A negative sample sequence representing the dataset, Z _k X represents _i And Y is equal to _j Is arranged according to a non-descending order, K=m+n, K is more than or equal to 1 and less than or equal to K; a, a _i X represents _i At Z _k Is a sequence of (a); c _i Represents Y _j At Z _k Is a sequence of (a) and (b).

Further, under the classification problem, according to the vectorization matrix, the obtained variance expression of the classifier AUC is as follows:

wherein ,

representing the expectation of a random variable, m representing the number of positive samples of the data set, n representing the number of negative samples of the data set; />Representing an AUC sample estimate; />Representing an exponential function

Further, under a classification problem, after the variance expression is reconstructed by using a dynamic programming term, the variance expression is as follows:

wherein ,

further, preprocessing the data set to convert the text of the data set into a vectorization matrix, including the following steps:

segmenting the text of the data set into a vocabulary, and converting the text of the data set into a vectorization matrix according to the vocabulary;

and automatically filling null values in the vectorization matrix, and carrying out Laplace smoothing on the vectorization matrix.

A classifier screening system comprising:

the preprocessing module is used for preprocessing the data set and converting the data of the data set into a vectorization matrix;

the variance expression acquisition module acquires a variance expression of the classifier AUC under the corresponding classification problem according to the vectorization matrix;

the AUC expression acquisition module based on dynamic programming acquires the AUC expression based on dynamic programming under the two classification problems by reconstructing the variance expression by using a dynamic programming term;

the classification module is used for acquiring an AUC sample estimated value of the classifier to be selected according to the AUC expression based on dynamic programming under the classification problem, and taking the classifier to be selected with the maximum AUC sample estimated value as the optimal classifier under the corresponding classification problem

The utility model also provides the following:

a storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of a classifier screening method as described above.

A computer device comprising a storage medium, a processor and a computer program stored in the storage medium and executable by the processor, which when executed by the processor performs the steps of a classifier screening method as described above.

Drawings

Fig. 1 is a flowchart of a classifier screening method provided in embodiment 1 of the present utility model;

fig. 2 is a flowchart of a classifier screening method provided in embodiment 2 of the present utility model;

fig. 3 is a schematic diagram of a classifier screening system according to an embodiment of the present utility model.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the present patent;

it should be understood that the described embodiments are merely some, but not all embodiments of the present utility model. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the utility model, are intended to be within the scope of the embodiments of the present utility model.

The terminology used in the embodiments of the utility model is for the purpose of describing particular embodiments only and is not intended to be limiting of embodiments of the utility model. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the utility model. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the utility model as detailed in the accompanying claims. In the description of the present utility model, it should be understood that the terms "first," "second," "third," and the like are used merely to distinguish between similar objects and are not necessarily used to describe a particular order or sequence, nor should they be construed to indicate or imply relative importance. The specific meaning of the above terms in the present utility model can be understood by those of ordinary skill in the art according to the specific circumstances.

Furthermore, in the description of the present utility model, unless otherwise indicated, "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship. The utility model is further illustrated in the following figures and examples.

In order to solve the limitations of the prior art, the present embodiment provides a technical solution, and the technical solution of the present utility model is further described below with reference to the drawings and the embodiments.

Referring to fig. 1, a classifier screening method includes the following steps:

s01, preprocessing a data set, and converting the text of the data set into a vectorization matrix;

s02, obtaining a variance expression of the AUC of the classifier under the corresponding classification problem according to the vectorization matrix;

s03, reconstructing the variance expression by using a dynamic programming term to obtain an AUC expression based on dynamic programming;

s04, obtaining an AUC sample estimated value of the classifier to be selected according to the AUC expression based on dynamic programming, and taking the classifier to be selected with the maximum AUC sample estimated value as the optimal classifier under the corresponding classification problem.

In particular, in an alternative embodiment, the text of the dataset may take the form of news text.

Classification problems are very widely used in reality, such as spam recognition, handwritten number recognition, face recognition, speech recognition, etc. Wherein, a classified problem, such as a microblog message data set, contains a plurality of words and numbers, and in order to shield the profoundly speaking, a rapid classification can be constructed: if a message is in a negative or insulting language, then the message is marked as being miscontained; for the problem of three classifications, such as in heart detection, three alternative diagnostic results can be obtained by measuring the heart beat frequency signal: normal heart rate, too fast heart beat, too slow heart beat; one more variable is included in the expression in the analysis process, which is equivalent to three samples; the AUC plane in the previous two classification problem also becomes curved. For the three-classification or multi-classification problem, the variance expression can be reconstructed by using a dynamic programming term to obtain a consistent and predictable result based on the principle of the patent, so that the description is omitted.

S ₁ ，ε(X＝Y)；S ₂ ，ε(X>Y)；S ₃ ，ε(X>Y>Y′)orε(X>Y>Y)；

S ₇ ，ε(X＝X>Y)orε(X＝X>Y)；S ₈ ，ε(X＝X>Y)；

S ₉ ，ε(X＝X>Y)orε(X>X＝Y)；S ₁₀ ，ε(X＝X＝Y)；

In the present embodiment, Z is _k X represents _i And Y is equal to _j Is arranged in a non-decreasing order, K=m+n, 1.ltoreq.k.ltoreq.K,

two technical vectors are readily available:

the dynamic programming term can thus be obtained as follows:

i and j each represent two independent identical sample numbers from i=1, …, n; j=1, …, m, l represents X _i and Y_j A value sequence in the sequencing of the samples; c _l 、t _l Respectively representing the positions and the number of the data in the data sequence of two independent and respectively sampled samples;m represents the positive sample number of the data set, and n represents the negative sample number of the data set; x is X _i A positive sample sequence representing the dataset, Y _j A negative sample sequence representing the dataset, Z _k X represents _i And Y is equal to _j Is arranged according to a non-descending order, K=m+n, K is more than or equal to 1 and less than or equal to K; a, a _i X represents _i At Z _k Is a sequence of (a); c _i Represents Y _j At Z _k Is a sequence of (a) and (b).

wherein ,

Specifically, for the above variance expression, it can be converted into a compact form of variance:

wherein ,

the above expression, although unbiased, is and />Is a cubic order, which is very inefficient for large m and n; after the dynamic programming term is introduced, the variance expression is further converted into:

wherein ,

in particular, the method comprises the steps of,indicating an exponential function, indicating that the function takes a value of 1 when the statement in the bracket is true, and indicating that the function takes a value of 0 when the statement in the bracket is false; n is n ² For the dynamic programming-based AUC expression, there is:

the method for obtaining the AUC based on the AUC expression of dynamic programming is unbiased.

s011, segmenting the text of the data set into a vocabulary, and converting the text of the data set into a vectorization matrix according to the vocabulary;

s012, automatically filling null values in the vectorization matrix, and carrying out Laplacian smoothing on the vectorization matrix.

Specifically, for English text, split function can be used to split with non-letters and non-numbers as symbols; for Chinese sentences, the jieba function is used for segmentation. Laplacian smoothing of the processed vocabulary is to solve the 0 probability problem, and avoid errors caused by underflow or floating point rounding in a logarithmic form. Meanwhile, the natural logarithm is adopted for processing without any loss.

A classifier screening system comprising:

the preprocessing module 1 is used for preprocessing the data set and converting the data of the data set into a vectorization matrix;

the variance expression acquisition module 2 acquires a variance expression of the classifier AUC under the corresponding classification problem according to the vectorization matrix;

the AUC expression acquisition module 3 based on dynamic programming acquires the AUC expression based on dynamic programming under the two classification problems by reconstructing the variance expression by using a dynamic programming term;

and the classification module 4 is used for acquiring an AUC sample estimated value of the classifier to be selected according to the AUC expression based on dynamic programming under the classification problem, and taking the classifier to be selected with the maximum AUC sample estimated value as the optimal classifier under the corresponding classification problem.

The utility model also provides the following:

It is to be understood that the above examples of the present utility model are provided by way of illustration only and not by way of limitation of the embodiments of the present utility model. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the utility model are desired to be protected by the following claims.

Claims

1. A classifier screening method, comprising the steps of:

acquiring an AUC sample estimated value of the classifier to be selected according to the AUC expression based on dynamic programming, and taking the classifier to be selected with the maximum AUC sample estimated value as the optimal classifier under the corresponding classification problem;

under the two classification problems, the dynamic programming item S ₁ ,S ₂ ,....S ₉ ,S ₁₀ The corresponding satisfaction ranges are as follows:

S ₁ ，ε(X＝Y)；S ₂ ，ε(X>Y)；S ₃ ，ε(X>Y>Y′)orε(X>Y′>Y)；

S ₄ ，ε(X>Y＝Y′)；S ₅ ，ε(X＝Y′>Y)orε(X＝Y>Y′)；S ₆ ，ε(X＝Y＝Y′)；

S ₇ ，ε(X＝X′>Y)orε(X′＝X>Y)；S ₈ ，ε(X＝X′>Y)；

S ₉ ，ε(X′＝X>Y)orε(X>X′＝Y)；S ₁₀ ，ε(X＝X′＝Y)；

wherein X and Y respectively represent independent identical distribution sample sets selected from two parent distributions, and X 'and Y' respectively correspond to the independent identical distribution sample sets of which X and Y have the same element;

the expression of the dynamic programming term is as follows:

i and j each represent two independent identical sample numbers from i=1, …, n; j=1, …, m, l represents X _i and Y_j A value sequence in the sequencing of the samples; c _l 、t _l Respectively representing the positions and the number of the data in the data sequence of two independent and respectively sampled samples;indicating an exponential function, indicating that the function takes a value of 1 when the statement in the bracket is true, and indicating that the function takes a value of 0 when the statement in the bracket is false; m represents the positive sample number of the data set, and n represents the negative sample number of the data set; x is X _i A positive sample sequence representing the dataset, Y _j A negative sample sequence representing the dataset, Z _k X represents _i And Y is equal to _j Is arranged according to a non-descending order, K=m+n, K is more than or equal to 1 and less than or equal to K; a, a _i X represents _i At Z _k Is a sequence of (a); c _i Represents Y _j At Z _k Is a sequence of (a) and (b).

2. The classifier screening method according to claim 1, wherein under a classification problem, the AUC expression based on dynamic programming is as follows:

3. The classifier screening method according to claim 1, wherein under the classification problem, according to the vectorization matrix, the obtained variance expression of the classifier AUC is as follows:

wherein ,

4. The classifier screening method according to claim 1, wherein under a classification problem, after the variance expression is reconstructed by using a dynamic programming term, the variance expression is as follows:

wherein ,

5. the classifier screening method of claim 1, wherein preprocessing the data set to convert text of the data set into a vectorization matrix comprises the steps of:

6. A classifier screening system, characterized by being applied to the classifier screening method of claim 1, comprising:

and the classification module is used for acquiring an AUC sample estimated value of the classifier to be selected according to the AUC expression based on dynamic programming under the classification problem, and taking the classifier to be selected with the maximum AUC sample estimated value as the optimal classifier under the corresponding classification problem.

7. A storage medium having a computer program stored thereon, characterized by: the computer program, when executed by a processor, implements the steps of the classifier screening method of any one of claims 1 to 5.

8. A computer device, characterized by: comprising a storage medium, a processor and a computer program stored in the storage medium and executable by the processor, which computer program, when executed by the processor, implements the steps of the classifier screening method of any one of claims 1 to 5.