WO2020150955A1

WO2020150955A1 - Data classification method and apparatus, and device and storage medium

Info

Publication number: WO2020150955A1
Application number: PCT/CN2019/072932
Authority: WO
Inventors: 何玉林
Original assignee: 深圳大学
Priority date: 2019-01-24
Filing date: 2019-01-24
Publication date: 2020-07-30

Abstract

Disclosed are a data classification method and apparatus, and a device and a storage medium. The method comprises: continuously encoding discrete value attributes to obtain a second continuous value attribute, wherein data comprises the discrete value attributes and the first continuous value attribute; training the second continuous value attribute by using a neural network, and using the data of the Ƒth hidden layer as a third continuous value attribute, wherein the neural network comprises Ƒ hidden layers; combining the first continuous value attribute and the third continuous value attribute to obtain a fourth continuous value attribute; and classifying the fourth continuous value attribute to obtain classified data. In the present invention, first, discrete value attributes are continuously encoded, and then a second continuous value attribute is trained by using a neural network, thereby completely transforming the discrete value attributes into a continuous value attribute having ordered information and using a real number.

Description

Data classification method, device, equipment and storage medium

Technical field

The present invention relates to the field of data processing technology, in particular to a data classification method, device, equipment and storage medium.

Background technique

In industrial scenarios, in order to ensure the normal operation of industrial equipment, various data of industrial equipment needs to be collected in real time to obtain operating data, and then the operating data is classified, and the operating status of the equipment is evaluated based on the classified operating data. Among them, the operating data are mostly mixed-value attributes, which include continuous value attributes and discrete value attributes.

In the prior art, a common classification method is to continuousize discrete-valued attributes, and then classify continuous-valued attributes. Usually, one-hot encoding is used to encode discrete-valued attributes into continuous attributes. For example, for a discrete-valued attribute B={B ₁ ,B ₂ ,B ₃ ,B ₄ } with 4 symbol values, when the values of the sample on attribute B are B ₁ , B ₂ , B ₃ and when B _4, followed by hot encoded value of the sample should properties are expressed as (1,0,0,0), (0,1,0,0), (0,0,1,0) And (0,0,0,1).

However, the attribute value after the one-hot encoding operation is still discrete in the sense of numerical distribution, and it does not fundamentally solve the continuity of the discrete value attribute.

Summary of the invention

The present invention provides a data classification method, device, equipment and storage medium to solve the problem that the existing classification method does not realize the continuity of discrete value attributes by using one-hot encoding operation.

In a first aspect, the present invention provides a data classification method, which includes: performing continuous encoding processing on discrete value attributes to obtain a second continuous value attribute, wherein the data includes the discrete value attribute and the first continuous value attribute; Continuous value attributes are trained, and the first

Hidden layer data as the third continuous value attribute, where the neural network includes

A hidden layer; the first continuous value attribute and the third continuous value attribute are merged to obtain the fourth continuous value attribute; the fourth continuous value attribute is classified to obtain the classified data.

In a data classification method provided by the present invention, the discrete-valued attributes are first continuously coded, and then the neural network is used to train the second continuous-valued attributes, thereby thoroughly transforming the discrete-valued attributes into ordered information A continuous value attribute whose value is a real number.

Optionally, after training the second continuous value attribute using a neural network, the first

Hidden layer data as the third continuous value attribute, it also includes: constructing the objective function, where the objective function is the sum of the error value of the third continuous value attribute and the entropy; using the second continuous value attribute to train the neural network , Until the value of the objective function is the minimum.

In a data classification method provided by the present invention, the error value of the third continuous value attribute and the sum of substituted entropy are used as the objective function to train the neural network, and the neural network is used to train the second continuous value attribute, that is, except In addition to ensuring the minimum error between the actual output and the actual output, it also ensures the minimum uncertainty of the data set after the conversion.

Optionally, constructing the objective function specifically includes: performing subtraction processing on the theoretical value of the third continuous value attribute and the third continuous value attribute to obtain an error value; performing data set division on the third continuous value attribute to obtain the third continuous value attribute A sub-data set, wherein the first data set includes a plurality of first sub-data sets; obtains the substitution entropy of the first sub-data set; superimposes the substitution entropy of the plurality of first sub-data sets to obtain the third consecutive The substitution entropy of the value attribute.

In a data classification method provided by the present invention, the third continuous value attribute is divided into data sets to obtain the first sub-data set, and the substitution entropy of the first sub-data set is obtained to obtain the substitution entropy of the first data set, Reduce computational complexity.

Optionally, obtaining the substitution entropy of the first sub-data set specifically includes:

The substitution entropy of the first sub-data set is obtained according to the first formula, where the first formula is:

Represents the first subdata, En[·] represents the entropy,

Indicates the number of samples of the data,

b _q represents the window width of the kernel density estimation method,

Respectively represent the nth and mth elements in the first subdata.

Optionally, performing superposition processing on the substitution entropy of a plurality of first sub-data sets to obtain the substitution entropy of the third continuous value attribute specifically includes:

The substitution entropy of the third continuous value attribute is obtained according to the second formula, where the second formula is:

among them,

For the first

The number of nodes in a hidden layer,

Is the third continuous value attribute,

The data classification device is introduced below, and its implementation principle and technical effect are similar to the principle and technical effect of the above method, and will not be repeated here.

In a second aspect, the present invention provides a data classification device, including: an obtaining module for performing continuous encoding processing on discrete value attributes to obtain a second continuous value attribute, wherein the data includes the discrete value attribute and the first continuous value attribute; Module, used to train the second continuous value attribute using neural network,

A hidden layer; the acquisition module is used to merge the first continuous value attribute and the third continuous value attribute to obtain the fourth continuous value attribute; the acquisition module is also used to classify the fourth continuous value attribute to Obtain the classified data.

Optionally, the device further includes: a construction module for constructing an objective function, where the objective function is the sum of the error value of the third continuous value attribute and the substituting entropy; and the training module is used for the neural network Perform training until the value of the objective function is the minimum value.

Optionally, the building module specifically includes: a subtraction module, which is used to subtract the third continuous value attribute and the theoretical value of the third continuous value attribute to obtain an error value; and the division module is used to subtract the third continuous value attribute. The attributes are divided into data sets to obtain the first sub-data set, where the first data set includes a plurality of first sub-data sets; the obtaining module is used to obtain the substitution entropy of the first sub-data set; the superposition module is used to compare The substitution entropy of the multiple first sub-data sets is superimposed to obtain the substitution entropy of the third continuous value attribute.

Optionally, the building module specifically includes:

Represents the first subdata, En[·] represents the entropy,

Indicates the number of samples of the data,

b _q represents the window width of the kernel density estimation method,

Respectively represent the nth and mth elements in the first subdata.

Optionally, the building module specifically includes:

among them,

For the first

The number of nodes in a hidden layer,

Is the third continuous value attribute,

The electronic device and the readable storage medium are introduced below, and their implementation principles and technical effects are similar to the principles and technical effects of the foregoing method, and will not be repeated here.

In a third aspect, the present invention provides an electronic device comprising: at least one processor and a memory; wherein the memory stores computer-executable instructions; at least one processor executes the computer-executable instructions stored in the memory, so that at least one processor executes the first aspect And the data classification method involved in the optional plan.

In a fourth aspect, the present invention provides a computer-readable storage medium, characterized in that the computer-readable storage medium stores computer-executable instructions, and when the processor executes the computer-executable instructions, the first aspect and the alternative solutions involved Data classification method.

The present invention provides a data classification method, device, equipment and storage medium. In the data classification method, a discrete value attribute is continuously encoded to obtain a second continuous value attribute; a neural network is used to train the second continuous value attribute, and First

Hidden layer data is used as the third continuous value attribute, thereby completely transforming discrete value attributes into continuous value attributes with order information and real numbers. After the first continuous value attribute and the third continuous value attribute are combined, the classification process is performed to obtain the classified data, so that the classification accuracy is higher than that in the prior art that only uses one-hot encoding to classify mixed-value attribute data Accuracy.

Description of the drawings

In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative labor.

Fig. 1 is a flowchart of a data classification method according to an exemplary embodiment of the present invention;

Fig. 2 is a flowchart of a data classification method according to an exemplary embodiment of the present invention;

Fig. 3 is a schematic diagram showing the structure of a data classification device according to an exemplary embodiment of the present invention;

Fig. 4 is a schematic structural diagram of an electronic device according to an exemplary embodiment of the present invention.

detailed description

In order to make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of the embodiments of the present invention, not all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart of a data classification method according to an exemplary embodiment of the present invention. As shown in Figure 1, the data classification method provided in this embodiment includes:

S101. Perform continuous encoding processing on the discrete value attribute to obtain a second continuous value attribute.

More specifically, the data includes discrete value attributes and first continuous value attributes. Continuously encoding the discrete-valued attributes to obtain the second continuous-valued attributes, and realize the preliminary conversion of the discrete-valued attributes into continuous-valued attributes.

In this embodiment, one-hot encoding can be used to convert the discrete value attribute into the second continuous value attribute.

For example: Suppose there is an existing mixed-value attribute data set shown in Table 1 below

Among them, the data set is divided into continuous value attributes and discrete value attributes;

with

Respectively represent the data set

The number of continuous and discrete value attributes included,

Representative data set

Contains the number of samples;

Representative

Continuous value attributes, then

Representative

Discrete-valued attributes, assuming its value is

Represents discrete value attributes

The number of values, then

Represents the category of the nth sample, assuming the data set

Share

Categories

then

Table 1: Mixed-value attribute data set

The data set composed of discrete value attributes as shown in Table 2 below

Perform one-hot encoding and get the one-hot encoding data set in Table 3 below

Table 2: Discrete value attribute data set

Table 3: One-hot encoding data set

In Table 3, the following formula (1) is satisfied:

S102. Use a neural network to train the second continuous value attribute to

The layer implicit layer data serves as the third continuous value attribute.

More specifically, the neural network includes

Hidden layer. The second continuous value attribute is input to the neural network for training, and the first

The hidden layer data is output as the third continuous value attribute.

In this example, construct a

A neural network with a hidden layer is called an Encoding Neural Network (ENN). Among them,

And take the one hot encoding data set shown in Table 3 as input.

The input of ENN is expressed by formula (2).

The number of input layer nodes of ENN is:

The output of ENN is expressed by formula (4):

The number of output layer nodes of ENN is

The hidden layer node uses the Sigmoid function to activate its input, and the f-th hidden layer contains

Nodes, where

The f-th hidden layer is expressed by formula (5).

After constructing ENN, use neural network to train the second continuous value attribute to

The hidden layer data is used as the third continuous value attribute, and the third continuous value attribute data set is shown in Table 4.

Table 4: The third continuous value attribute data set

S103: Perform merging processing on the first continuous value attribute and the third continuous value attribute to obtain a fourth continuous value attribute.

More specifically, the first continuous value attribute and the third continuous value attribute are combined to obtain a fourth continuous value attribute, where the fourth continuous value attribute includes the first continuous value attribute and the third continuous value attribute.

In this embodiment, the third continuous value attribute is expressed as:

Combine the data corresponding to the first continuous value attribute in Table 1 with the data corresponding to the third continuous value attribute in Table 4 to obtain a fourth continuous value data set whose attribute values are all real values

As shown in Table 5.

Table 5: Real-valued attribute data set

S104: Perform classification processing on the fourth continuous value attribute to obtain classified data.

More specifically, any classification method for continuous-valued attribute data can be used, such as support vector machines and neural networks, to process real-valued attribute data sets

In the data classification method provided in this embodiment, the discrete value attribute is continuously encoded to obtain the second continuous value attribute; the neural network is used to train the second continuous value attribute, and the first

Fig. 2 is a flowchart of a data classification method according to an exemplary embodiment of the present invention. As shown in Figure 2, the data classification method provided in this embodiment includes:

S201: Perform continuous encoding processing on the discrete value attribute to obtain a second continuous value attribute.

S202. Construct an objective function, and use the second continuous value attribute to train the neural network until the value of the objective function is the minimum value.

More specifically, the objective function is the sum of the error value of the third continuous value attribute and the substituted entropy.

According to formula (6), the loss function L of ENN is expressed:

Among them, E[·] is the third continuous value attribute data set corresponding to ENN

The error between the actual output and the theoretical output, U[·] is the first

Hidden layer data

Uncertainty.

The error value can be obtained by subtracting the theoretical value of the third continuous value attribute and the third continuous value attribute.

The following steps are used to calculate the substitution entropy of the third continuous value attribute:

S301: Perform data set division on the third continuous value attribute to obtain a first sub-data set.

More specifically, the first data set includes a plurality of first sub-data sets.

Among them, the third continuous value attribute data set is expressed as:

The first sub-data set is expressed as:

S302. Obtain the substitution entropy of the first sub-data set.

More specifically, the substitution entropy calculation method of the first sub-data set is as follows:

among them,

Data set

Corresponding to the entropy,

Represented in the data set

The data set obtained by the kernel density estimation method

The probability density function.

Is calculated as follows:

Among them, b _q represents the window width parameter of the kernel density estimation method, b _q > 0, and b _q is about the number of samples

The function of, which satisfies the following conditions:

S302: Perform superposition processing on the substitution entropy of the multiple first sub-data sets to obtain the substitution entropy of the third continuous value attribute.

More specifically, the substitution entropy U[·] of the third continuous value attribute is calculated as follows:

Construct the above objective function to obtain the first that minimizes the loss function

Hidden layer output matrix

The training process of neural network adopts the training mode of traditional neural network, which will not be repeated here.

S203. Use a neural network to train the second continuous value attribute to

The layer implicit layer data serves as the third continuous value attribute.

S204: Perform merging processing on the first continuous value attribute and the third continuous value attribute to obtain a fourth continuous value attribute.

S205: Perform classification processing on the fourth continuous value attribute to obtain classified data.

In the data classification method provided in this embodiment, when designing a neural network for the continuity of discrete-valued attributes, we introduce the uncertainty of the data set into the loss function, that is, in addition to ensuring that the actual output and the true output are between In addition to the smallest error, it also ensures the smallest uncertainty of the data set after conversion. Experimental results show that, compared with the traditional one-hot encoding method, deep encoding enables support vector machines and neural networks to obtain higher classification accuracy on mixed attribute data sets.

Fig. 3 is a schematic diagram showing the structure of a data classification device according to an exemplary embodiment of the present invention. As shown in FIG. 3, this embodiment provides a data classification device, including: an obtaining module 101, configured to perform continuous encoding processing on discrete value attributes to obtain a second continuous value attribute, where the data includes the discrete value attribute and the first continuous value attribute. Value attribute; as the module 102, it is used to train the second continuous value attribute using a neural network, and the first

Hidden layers; the obtaining module 101 is also used to merge the first continuous value attribute and the third continuous value attribute to obtain the fourth continuous value attribute; the obtaining module 101 is also used to classify the fourth continuous value attribute To obtain classified data.

Optionally, the device further includes: a construction module 103 for constructing an objective function, where the objective function is the sum of the error value and the substitution entropy of the third continuous value attribute; the training module 104 is used for using the second continuous value attribute pair The neural network is trained until the value of the objective function is the minimum.

Optionally, the construction module 103 specifically includes: performing subtraction processing on the third continuous value attribute and theoretical values of the third continuous value attribute to obtain an error value; performing data set division on the third continuous value attribute to obtain the first A sub-data set, where the first data set includes a plurality of first sub-data sets; the substitution entropy of the first sub-data set is obtained; the superposition module is used to superimpose the substitution entropy of the plurality of first sub-data sets to Obtain the substitution entropy of the third continuous value attribute.

Optionally, the building module 103 specifically includes:

Represents the first sub-data, Entropy[·] represents the entropy,

Indicates the number of samples of the data,

b _q represents the window width of the kernel density estimation method,

Respectively represent the nth and mth elements in the first subdata.

Optionally, the building module 103 specifically includes:

among them,

For the first

The number of nodes in a hidden layer,

Is the third continuous value attribute,

In short, the data classification device provided by this application can be used to implement the above data classification method, and its content and effects can be referred to the method part, which will not be repeated in this application.

Fig. 4 is a schematic structural diagram of an electronic device according to an exemplary embodiment of the present invention. As shown in FIG. 4, the electronic device 200 of this embodiment includes a processor 201 and a memory 202, where:

The memory 202 is used to store computer execution instructions;

The processor 201 is configured to execute computer-executable instructions stored in the memory to implement each step executed by the receiving device in the foregoing embodiment. For details, refer to the related description in the foregoing method embodiment.

Optionally, the memory 202 may be independent or integrated with the processor 201.

When the memory 202 is independently provided, the flow control device 200 further includes a bus 203 for connecting the memory 202 and the processor 201.

The embodiment of the present invention also provides a computer-readable storage medium, in which computer-executable instructions are stored, and when the processor executes the computer-executable instructions, the data classification method as described above is implemented.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand: The technical solutions recorded in the foregoing embodiments can still be modified, or some or all of the technical features can be equivalently replaced; these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the embodiments of the present invention range.

Claims

A data classification method is characterized in that it includes:

Continuously encoding the discrete value attribute to obtain a second continuous value attribute, where the data includes the discrete value attribute and the first continuous value attribute;

Use the neural network to train the second continuous value attribute, and the first
Hidden layer data as the third continuous value attribute, wherein the neural network includes
Hidden layer

Merging the first continuous value attribute and the third continuous value attribute to obtain a fourth continuous value attribute;

Perform classification processing on the fourth continuous value attribute to obtain classified data.
The method according to claim 1, characterized in that, in the training of the second continuous value attribute by the neural network, the first
Hidden layer data as the third continuous value attribute, previously included:

Constructing an objective function, wherein the objective function is the sum of the error value of the third continuous value attribute and the substituted entropy;

The neural network is trained using the second continuous value attribute until the value of the objective function is the minimum value.
The method according to claim 2, wherein the constructing the objective function specifically comprises:

Performing subtraction processing on the theoretical value of the third continuous value attribute and the third continuous value attribute to obtain the error value;

Data set division is performed on the third continuous value attribute to obtain a first sub-data set, where the first data set includes a plurality of first sub-data sets;

Obtaining the substitution entropy of the first sub-data set;

Perform superposition processing on the substitution entropy of the multiple first sub-data sets to obtain the substitution entropy of the third continuous value attribute.
The method according to claim 3, wherein the obtaining the substitution entropy of the first sub-data set specifically comprises:

The substitution entropy of the first sub-data set is obtained according to the first formula, where the first formula is:
Represents the first subdata, En[·] represents the entropy,
Indicates the number of samples of the data,
b q represents the window width of the kernel density estimation method,
Respectively represent the nth and mth elements in the first subdata.
The method according to claim 3, wherein the superimposing the substitution entropy of the multiple first sub-data sets to obtain the substitution entropy of the third continuous value attribute specifically comprises:

The substitution entropy of the third continuous value attribute is obtained according to the second formula, where the second formula is:
among them,
For the first
The number of nodes in a hidden layer,
Is the third continuous value attribute,
A data classification device, characterized in that it comprises:

An obtaining module, configured to perform continuous encoding processing on discrete value attributes to obtain a second continuous value attribute, wherein the data includes the discrete value attribute and the first continuous value attribute;

As a module, it is used to train the second continuous value attribute using a neural network, and the first
Hidden layer data as the third continuous value attribute, wherein the neural network includes
Hidden layer

The obtaining module is further configured to merge the first continuous value attribute and the third continuous value attribute to obtain a fourth continuous value attribute;

The obtaining module is further configured to perform classification processing on the fourth continuous value attribute to obtain classified data.
The device according to claim 6, wherein the device further comprises:

A building module for building an objective function, wherein the objective function is the sum of the error value of the third continuous value attribute and the substituted entropy;

The training module is used to train the neural network by using the second continuous value attribute until the value of the objective function is the minimum value.
The device according to claim 7, wherein the building module is specifically used for:

A subtraction module, configured to perform subtraction processing on the third continuous value attribute and the theoretical value of the third continuous value attribute to obtain the error value;

A dividing module, configured to divide a data set of the third continuous value attribute to obtain a first sub-data set, where the first data set includes a plurality of first sub-data sets;

An obtaining module for obtaining the substitution entropy of the first sub-data set;

The superposition module is used to superimpose the substitution entropy of a plurality of the first sub-data sets to obtain the substitution entropy of the third continuous value attribute.
An electronic device, characterized by comprising: at least one processor and a memory;

Wherein, the memory stores computer execution instructions;

The at least one processor executes computer-executable instructions stored in the memory, so that the at least one processor executes the data classification method according to any one of claims 1 to 5.
A computer-readable storage medium, wherein a computer-executable instruction is stored in the computer-readable storage medium, and when the processor executes the computer-executable instruction, the computer-executable instruction is implemented as described in any one of claims 1 to 5 Data classification method.