WO2020150955A1 - Data classification method and apparatus, and device and storage medium - Google Patents

Data classification method and apparatus, and device and storage medium Download PDF

Info

Publication number
WO2020150955A1
WO2020150955A1 PCT/CN2019/072932 CN2019072932W WO2020150955A1 WO 2020150955 A1 WO2020150955 A1 WO 2020150955A1 CN 2019072932 W CN2019072932 W CN 2019072932W WO 2020150955 A1 WO2020150955 A1 WO 2020150955A1
Authority
WO
WIPO (PCT)
Prior art keywords
value attribute
continuous value
data
continuous
attribute
Prior art date
Application number
PCT/CN2019/072932
Other languages
French (fr)
Chinese (zh)
Inventor
何玉林
Original Assignee
深圳大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳大学 filed Critical 深圳大学
Priority to PCT/CN2019/072932 priority Critical patent/WO2020150955A1/en
Publication of WO2020150955A1 publication Critical patent/WO2020150955A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition

Definitions

  • the present invention relates to the field of data processing technology, in particular to a data classification method, device, equipment and storage medium.
  • the operating data are mostly mixed-value attributes, which include continuous value attributes and discrete value attributes.
  • a common classification method is to continuousize discrete-valued attributes, and then classify continuous-valued attributes.
  • the attribute value after the one-hot encoding operation is still discrete in the sense of numerical distribution, and it does not fundamentally solve the continuity of the discrete value attribute.
  • the present invention provides a data classification method, device, equipment and storage medium to solve the problem that the existing classification method does not realize the continuity of discrete value attributes by using one-hot encoding operation.
  • the present invention provides a data classification method, which includes: performing continuous encoding processing on discrete value attributes to obtain a second continuous value attribute, wherein the data includes the discrete value attribute and the first continuous value attribute; Continuous value attributes are trained, and the first Hidden layer data as the third continuous value attribute, where the neural network includes A hidden layer; the first continuous value attribute and the third continuous value attribute are merged to obtain the fourth continuous value attribute; the fourth continuous value attribute is classified to obtain the classified data.
  • the discrete-valued attributes are first continuously coded, and then the neural network is used to train the second continuous-valued attributes, thereby thoroughly transforming the discrete-valued attributes into ordered information A continuous value attribute whose value is a real number.
  • the first Hidden layer data as the third continuous value attribute, it also includes: constructing the objective function, where the objective function is the sum of the error value of the third continuous value attribute and the entropy; using the second continuous value attribute to train the neural network , Until the value of the objective function is the minimum.
  • the error value of the third continuous value attribute and the sum of substituted entropy are used as the objective function to train the neural network, and the neural network is used to train the second continuous value attribute, that is, except In addition to ensuring the minimum error between the actual output and the actual output, it also ensures the minimum uncertainty of the data set after the conversion.
  • constructing the objective function specifically includes: performing subtraction processing on the theoretical value of the third continuous value attribute and the third continuous value attribute to obtain an error value; performing data set division on the third continuous value attribute to obtain the third continuous value attribute A sub-data set, wherein the first data set includes a plurality of first sub-data sets; obtains the substitution entropy of the first sub-data set; superimposes the substitution entropy of the plurality of first sub-data sets to obtain the third consecutive The substitution entropy of the value attribute.
  • the third continuous value attribute is divided into data sets to obtain the first sub-data set, and the substitution entropy of the first sub-data set is obtained to obtain the substitution entropy of the first data set, Reduce computational complexity.
  • obtaining the substitution entropy of the first sub-data set specifically includes:
  • the substitution entropy of the first sub-data set is obtained according to the first formula, where the first formula is: Represents the first subdata, En[ ⁇ ] represents the entropy, Indicates the number of samples of the data, b q represents the window width of the kernel density estimation method, Respectively represent the nth and mth elements in the first subdata.
  • performing superposition processing on the substitution entropy of a plurality of first sub-data sets to obtain the substitution entropy of the third continuous value attribute specifically includes:
  • the substitution entropy of the third continuous value attribute is obtained according to the second formula, where the second formula is: among them, For the first The number of nodes in a hidden layer, Is the third continuous value attribute,
  • the data classification device is introduced below, and its implementation principle and technical effect are similar to the principle and technical effect of the above method, and will not be repeated here.
  • the present invention provides a data classification device, including: an obtaining module for performing continuous encoding processing on discrete value attributes to obtain a second continuous value attribute, wherein the data includes the discrete value attribute and the first continuous value attribute; Module, used to train the second continuous value attribute using neural network, Hidden layer data as the third continuous value attribute, where the neural network includes A hidden layer; the acquisition module is used to merge the first continuous value attribute and the third continuous value attribute to obtain the fourth continuous value attribute; the acquisition module is also used to classify the fourth continuous value attribute to Obtain the classified data.
  • the device further includes: a construction module for constructing an objective function, where the objective function is the sum of the error value of the third continuous value attribute and the substituting entropy; and the training module is used for the neural network Perform training until the value of the objective function is the minimum value.
  • the building module specifically includes: a subtraction module, which is used to subtract the third continuous value attribute and the theoretical value of the third continuous value attribute to obtain an error value; and the division module is used to subtract the third continuous value attribute.
  • the attributes are divided into data sets to obtain the first sub-data set, where the first data set includes a plurality of first sub-data sets; the obtaining module is used to obtain the substitution entropy of the first sub-data set; the superposition module is used to compare The substitution entropy of the multiple first sub-data sets is superimposed to obtain the substitution entropy of the third continuous value attribute.
  • the building module specifically includes:
  • the substitution entropy of the first sub-data set is obtained according to the first formula, where the first formula is: Represents the first subdata, En[ ⁇ ] represents the entropy, Indicates the number of samples of the data, b q represents the window width of the kernel density estimation method, Respectively represent the nth and mth elements in the first subdata.
  • the building module specifically includes:
  • the substitution entropy of the third continuous value attribute is obtained according to the second formula, where the second formula is: among them, For the first The number of nodes in a hidden layer, Is the third continuous value attribute,
  • the present invention provides an electronic device comprising: at least one processor and a memory; wherein the memory stores computer-executable instructions; at least one processor executes the computer-executable instructions stored in the memory, so that at least one processor executes the first aspect And the data classification method involved in the optional plan.
  • the present invention provides a computer-readable storage medium, characterized in that the computer-readable storage medium stores computer-executable instructions, and when the processor executes the computer-executable instructions, the first aspect and the alternative solutions involved Data classification method.
  • the present invention provides a data classification method, device, equipment and storage medium.
  • a discrete value attribute is continuously encoded to obtain a second continuous value attribute; a neural network is used to train the second continuous value attribute, and First Hidden layer data is used as the third continuous value attribute, thereby completely transforming discrete value attributes into continuous value attributes with order information and real numbers.
  • the classification process is performed to obtain the classified data, so that the classification accuracy is higher than that in the prior art that only uses one-hot encoding to classify mixed-value attribute data Accuracy.
  • Fig. 1 is a flowchart of a data classification method according to an exemplary embodiment of the present invention
  • Fig. 2 is a flowchart of a data classification method according to an exemplary embodiment of the present invention
  • Fig. 3 is a schematic diagram showing the structure of a data classification device according to an exemplary embodiment of the present invention.
  • Fig. 4 is a schematic structural diagram of an electronic device according to an exemplary embodiment of the present invention.
  • the present invention provides a data classification method, device, equipment and storage medium to solve the problem that the existing classification method does not realize the continuity of discrete value attributes by using one-hot encoding operation.
  • Fig. 1 is a flowchart of a data classification method according to an exemplary embodiment of the present invention. As shown in Figure 1, the data classification method provided in this embodiment includes:
  • the data includes discrete value attributes and first continuous value attributes. Continuously encoding the discrete-valued attributes to obtain the second continuous-valued attributes, and realize the preliminary conversion of the discrete-valued attributes into continuous-valued attributes.
  • one-hot encoding can be used to convert the discrete value attribute into the second continuous value attribute.
  • the data set is divided into continuous value attributes and discrete value attributes; with Respectively represent the data set The number of continuous and discrete value attributes included, Representative data set Contains the number of samples; Representative Continuous value attributes, then Representative Discrete-valued attributes, assuming its value is Represents discrete value attributes The number of values, then Represents the category of the nth sample, assuming the data set Share Categories then
  • the neural network includes Hidden layer.
  • the second continuous value attribute is input to the neural network for training, and the first The hidden layer data is output as the third continuous value attribute.
  • an Encoding Neural Network (ENN). Among them, And take the one hot encoding data set shown in Table 3 as input.
  • the input of ENN is expressed by formula (2).
  • the number of input layer nodes of ENN is:
  • the number of output layer nodes of ENN is the number of output layer nodes of ENN.
  • the hidden layer node uses the Sigmoid function to activate its input, and the f-th hidden layer contains Nodes, where The f-th hidden layer is expressed by formula (5).
  • Table 4 The third continuous value attribute data set
  • S103 Perform merging processing on the first continuous value attribute and the third continuous value attribute to obtain a fourth continuous value attribute.
  • the first continuous value attribute and the third continuous value attribute are combined to obtain a fourth continuous value attribute, where the fourth continuous value attribute includes the first continuous value attribute and the third continuous value attribute.
  • the third continuous value attribute is expressed as:
  • S104 Perform classification processing on the fourth continuous value attribute to obtain classified data.
  • any classification method for continuous-valued attribute data can be used, such as support vector machines and neural networks, to process real-valued attribute data sets
  • the discrete value attribute is continuously encoded to obtain the second continuous value attribute; the neural network is used to train the second continuous value attribute, and the first Hidden layer data is used as the third continuous value attribute, thereby completely transforming discrete value attributes into continuous value attributes with order information and real numbers.
  • the classification process is performed to obtain the classified data, so that the classification accuracy is higher than that in the prior art that only uses one-hot encoding to classify mixed-value attribute data Accuracy.
  • Fig. 2 is a flowchart of a data classification method according to an exemplary embodiment of the present invention. As shown in Figure 2, the data classification method provided in this embodiment includes:
  • S201 Perform continuous encoding processing on the discrete value attribute to obtain a second continuous value attribute.
  • S202 Construct an objective function, and use the second continuous value attribute to train the neural network until the value of the objective function is the minimum value.
  • the objective function is the sum of the error value of the third continuous value attribute and the substituted entropy.
  • E[ ⁇ ] is the third continuous value attribute data set corresponding to ENN
  • U[ ⁇ ] is the first Hidden layer data Uncertainty.
  • the error value can be obtained by subtracting the theoretical value of the third continuous value attribute and the third continuous value attribute.
  • S301 Perform data set division on the third continuous value attribute to obtain a first sub-data set.
  • the first data set includes a plurality of first sub-data sets.
  • the third continuous value attribute data set is expressed as:
  • the first sub-data set is expressed as:
  • substitution entropy calculation method of the first sub-data set is as follows:
  • Data set Corresponding to the entropy, Represented in the data set The data set obtained by the kernel density estimation method The probability density function.
  • b q represents the window width parameter of the kernel density estimation method
  • b q is about the number of samples
  • S302 Perform superposition processing on the substitution entropy of the multiple first sub-data sets to obtain the substitution entropy of the third continuous value attribute.
  • substitution entropy U[ ⁇ ] of the third continuous value attribute is calculated as follows:
  • S204 Perform merging processing on the first continuous value attribute and the third continuous value attribute to obtain a fourth continuous value attribute.
  • S205 Perform classification processing on the fourth continuous value attribute to obtain classified data.
  • Fig. 3 is a schematic diagram showing the structure of a data classification device according to an exemplary embodiment of the present invention.
  • this embodiment provides a data classification device, including: an obtaining module 101, configured to perform continuous encoding processing on discrete value attributes to obtain a second continuous value attribute, where the data includes the discrete value attribute and the first continuous value attribute. Value attribute; as the module 102, it is used to train the second continuous value attribute using a neural network, and the first Hidden layer data as the third continuous value attribute, where the neural network includes Hidden layers; the obtaining module 101 is also used to merge the first continuous value attribute and the third continuous value attribute to obtain the fourth continuous value attribute; the obtaining module 101 is also used to classify the fourth continuous value attribute To obtain classified data.
  • the device further includes: a construction module 103 for constructing an objective function, where the objective function is the sum of the error value and the substitution entropy of the third continuous value attribute; the training module 104 is used for using the second continuous value attribute pair The neural network is trained until the value of the objective function is the minimum.
  • the construction module 103 specifically includes: performing subtraction processing on the third continuous value attribute and theoretical values of the third continuous value attribute to obtain an error value; performing data set division on the third continuous value attribute to obtain the first A sub-data set, where the first data set includes a plurality of first sub-data sets; the substitution entropy of the first sub-data set is obtained; the superposition module is used to superimpose the substitution entropy of the plurality of first sub-data sets to Obtain the substitution entropy of the third continuous value attribute.
  • the building module 103 specifically includes:
  • the substitution entropy of the first sub-data set is obtained according to the first formula, where the first formula is: Represents the first sub-data, Entropy[ ⁇ ] represents the entropy, Indicates the number of samples of the data, b q represents the window width of the kernel density estimation method, Respectively represent the nth and mth elements in the first subdata.
  • the building module 103 specifically includes:
  • the substitution entropy of the third continuous value attribute is obtained according to the second formula, where the second formula is: among them, For the first The number of nodes in a hidden layer, Is the third continuous value attribute,
  • the data classification device provided by this application can be used to implement the above data classification method, and its content and effects can be referred to the method part, which will not be repeated in this application.
  • Fig. 4 is a schematic structural diagram of an electronic device according to an exemplary embodiment of the present invention.
  • the electronic device 200 of this embodiment includes a processor 201 and a memory 202, where:
  • the memory 202 is used to store computer execution instructions
  • the processor 201 is configured to execute computer-executable instructions stored in the memory to implement each step executed by the receiving device in the foregoing embodiment. For details, refer to the related description in the foregoing method embodiment.
  • the memory 202 may be independent or integrated with the processor 201.
  • the flow control device 200 further includes a bus 203 for connecting the memory 202 and the processor 201.
  • the embodiment of the present invention also provides a computer-readable storage medium, in which computer-executable instructions are stored, and when the processor executes the computer-executable instructions, the data classification method as described above is implemented.

Abstract

Disclosed are a data classification method and apparatus, and a device and a storage medium. The method comprises: continuously encoding discrete value attributes to obtain a second continuous value attribute, wherein data comprises the discrete value attributes and the first continuous value attribute; training the second continuous value attribute by using a neural network, and using the data of the Ƒth hidden layer as a third continuous value attribute, wherein the neural network comprises Ƒ hidden layers; combining the first continuous value attribute and the third continuous value attribute to obtain a fourth continuous value attribute; and classifying the fourth continuous value attribute to obtain classified data. In the present invention, first, discrete value attributes are continuously encoded, and then a second continuous value attribute is trained by using a neural network, thereby completely transforming the discrete value attributes into a continuous value attribute having ordered information and using a real number.

Description

数据分类方法、装置、设备以及存储介质Data classification method, device, equipment and storage medium 技术领域Technical field
本发明涉及数据处理技术领域,尤其涉及一种数据分类方法、装置、设备以及存储介质。The present invention relates to the field of data processing technology, in particular to a data classification method, device, equipment and storage medium.
背景技术Background technique
在工业场景中,为了保证工业设备正常工作,需要实时采集工业设备的各种数据以获得运行数据,再对运行数据进行分类,依据分类后运行数据对设备运行状态进行评价。其中,运行数据多为混合值属性,混合值属性包含连续值属性和离散值属性。In industrial scenarios, in order to ensure the normal operation of industrial equipment, various data of industrial equipment needs to be collected in real time to obtain operating data, and then the operating data is classified, and the operating status of the equipment is evaluated based on the classified operating data. Among them, the operating data are mostly mixed-value attributes, which include continuous value attributes and discrete value attributes.
在现有技术中,常见的分类方法是将离散值属性进行连续化,然后对连续值属性进行分类处理。通常采用独热编码(One-hot Encoding)将离散值属性编码为连续型属性。例如,对于含有4个符号取值的离散值属性B={B 1,B 2,B 3,B 4}而言,当样本在属性B上的取值分别为B 1、B 2、B 3和B 4时,进行独热编码之后样本对应该属性的取值分别被表示为(1,0,0,0)、(0,1,0,0)、(0,0,1,0)和(0,0,0,1)。 In the prior art, a common classification method is to continuousize discrete-valued attributes, and then classify continuous-valued attributes. Usually, one-hot encoding is used to encode discrete-valued attributes into continuous attributes. For example, for a discrete-valued attribute B={B 1 ,B 2 ,B 3 ,B 4 } with 4 symbol values, when the values of the sample on attribute B are B 1 , B 2 , B 3 and when B 4, followed by hot encoded value of the sample should properties are expressed as (1,0,0,0), (0,1,0,0), (0,0,1,0) And (0,0,0,1).
然而,进行独热编码操作之后的属性取值在数值分布意义上讲仍是离散的,并没有从根本上解决离散值属性的连续化。However, the attribute value after the one-hot encoding operation is still discrete in the sense of numerical distribution, and it does not fundamentally solve the continuity of the discrete value attribute.
发明内容Summary of the invention
本发明提供一种数据分类方法、装置、设备以及存储介质,以解决现有的分类方法采用独热编码操作没有实现离散值属性的连续化的问题。The present invention provides a data classification method, device, equipment and storage medium to solve the problem that the existing classification method does not realize the continuity of discrete value attributes by using one-hot encoding operation.
第一方面,本发明提供一种数据分类方法,包括:对离散值属性进行连续编码处理获得第二连续值属性,其中,数据包括离散值属性和第一连续值属性;利用神经网络对第二连续值属性进行训练,将第
Figure PCTCN2019072932-appb-000001
层隐含层数据作为第三连续值属性,其中,神经网络包括
Figure PCTCN2019072932-appb-000002
个隐含层;对第一连续值属性和第三连续值属性进行合并处理,以获得第四连续值属性;对第四连续值属性进行分类处理, 以获得分类后数据。
In a first aspect, the present invention provides a data classification method, which includes: performing continuous encoding processing on discrete value attributes to obtain a second continuous value attribute, wherein the data includes the discrete value attribute and the first continuous value attribute; Continuous value attributes are trained, and the first
Figure PCTCN2019072932-appb-000001
Hidden layer data as the third continuous value attribute, where the neural network includes
Figure PCTCN2019072932-appb-000002
A hidden layer; the first continuous value attribute and the third continuous value attribute are merged to obtain the fourth continuous value attribute; the fourth continuous value attribute is classified to obtain the classified data.
在本发明提供的一种数据分类方法中,先将离散值属性进行连续化编码,再利用神经网络对第二连续值属性进行训练,从而将离散值属性彻底转化成带有序信息的、取值为实数的连续值属性。In a data classification method provided by the present invention, the discrete-valued attributes are first continuously coded, and then the neural network is used to train the second continuous-valued attributes, thereby thoroughly transforming the discrete-valued attributes into ordered information A continuous value attribute whose value is a real number.
可选地,在利用神经网络对第二连续值属性进行训练,将第
Figure PCTCN2019072932-appb-000003
层隐含层数据作为第三连续值属性,之前还包括:构建目标函数,其中,目标函数为第三连续值属性的误差值和代入熵之和;利用第二连续值属性对神经网络进行训练,直至目标函数的数值为最小值。
Optionally, after training the second continuous value attribute using a neural network, the first
Figure PCTCN2019072932-appb-000003
Hidden layer data as the third continuous value attribute, it also includes: constructing the objective function, where the objective function is the sum of the error value of the third continuous value attribute and the entropy; using the second continuous value attribute to train the neural network , Until the value of the objective function is the minimum.
在本发明提供的一种数据分类方法中,以第三连续值属性的误差值和代入熵之和作为目标函数,对神经网络进行训练,利用神经网络对第二连续值属性进行训练,即除了保证实际输出与真实输出之间的误差最小外,还保证了转换之后数据集的不确定性最小。In a data classification method provided by the present invention, the error value of the third continuous value attribute and the sum of substituted entropy are used as the objective function to train the neural network, and the neural network is used to train the second continuous value attribute, that is, except In addition to ensuring the minimum error between the actual output and the actual output, it also ensures the minimum uncertainty of the data set after the conversion.
可选地,构建目标函数,具体包括:对第三连续值属性和第三连续值属性的理论值进行相减处理,以获得误差值;对第三连续值属性进行数据集划分,以获得第一子数据集,其中,第一数据集包括多个第一子数据集;获得第一子数据集的代入熵;对多个第一子数据集的代入熵进行叠加处理,以获得第三连续值属性的代入熵。Optionally, constructing the objective function specifically includes: performing subtraction processing on the theoretical value of the third continuous value attribute and the third continuous value attribute to obtain an error value; performing data set division on the third continuous value attribute to obtain the third continuous value attribute A sub-data set, wherein the first data set includes a plurality of first sub-data sets; obtains the substitution entropy of the first sub-data set; superimposes the substitution entropy of the plurality of first sub-data sets to obtain the third consecutive The substitution entropy of the value attribute.
在本发明提供的一种数据分类方法中,对第三连续值属性进行数据集划分获得第一子数据集,通过获取第一子数据集的代入熵,以获得第一数据集的代入熵,减少运算复杂度。In a data classification method provided by the present invention, the third continuous value attribute is divided into data sets to obtain the first sub-data set, and the substitution entropy of the first sub-data set is obtained to obtain the substitution entropy of the first data set, Reduce computational complexity.
可选地,获得第一子数据集的代入熵,具体包括:Optionally, obtaining the substitution entropy of the first sub-data set specifically includes:
根据第一公式获得第一子数据集的代入熵,其中,第一公式为:
Figure PCTCN2019072932-appb-000004
表示第一子数据,En[·]表示代入熵,
Figure PCTCN2019072932-appb-000005
表示数据的样本个数,
Figure PCTCN2019072932-appb-000006
b q表示核密度 估计方法的窗口宽度,
Figure PCTCN2019072932-appb-000007
分别表示第一子数据中第n和m个元素。
The substitution entropy of the first sub-data set is obtained according to the first formula, where the first formula is:
Figure PCTCN2019072932-appb-000004
Represents the first subdata, En[·] represents the entropy,
Figure PCTCN2019072932-appb-000005
Indicates the number of samples of the data,
Figure PCTCN2019072932-appb-000006
b q represents the window width of the kernel density estimation method,
Figure PCTCN2019072932-appb-000007
Respectively represent the nth and mth elements in the first subdata.
可选地,对多个第一子数据集的代入熵进行叠加处理,以获得第三连续值属性的代入熵,具体包括:Optionally, performing superposition processing on the substitution entropy of a plurality of first sub-data sets to obtain the substitution entropy of the third continuous value attribute specifically includes:
根据第二公式获得第三连续值属性的代入熵,其中,第二公式为:
Figure PCTCN2019072932-appb-000008
其中,
Figure PCTCN2019072932-appb-000009
为第
Figure PCTCN2019072932-appb-000010
个隐含层含的节点数量,
Figure PCTCN2019072932-appb-000011
为第三连续值属性,
Figure PCTCN2019072932-appb-000012
The substitution entropy of the third continuous value attribute is obtained according to the second formula, where the second formula is:
Figure PCTCN2019072932-appb-000008
among them,
Figure PCTCN2019072932-appb-000009
For the first
Figure PCTCN2019072932-appb-000010
The number of nodes in a hidden layer,
Figure PCTCN2019072932-appb-000011
Is the third continuous value attribute,
Figure PCTCN2019072932-appb-000012
下面对数据分类装置进行介绍,其实现原理和技术效果与上述方法原理和技术效果类似,此处不再赘述。The data classification device is introduced below, and its implementation principle and technical effect are similar to the principle and technical effect of the above method, and will not be repeated here.
第二方面,本发明提供一种数据分类装置,包括:获得模块,用于对离散值属性进行连续编码处理获得第二连续值属性,其中,数据包括离散值属性和第一连续值属性;作为模块,用于利用神经网络对第二连续值属性进行训练,将第
Figure PCTCN2019072932-appb-000013
层隐含层数据作为第三连续值属性,其中,神经网络包括
Figure PCTCN2019072932-appb-000014
个隐含层;获得模块,用于对第一连续值属性和第三连续值属性进行合并处理,以获得第四连续值属性;获得模块还用于对第四连续值属性进行分类处理,以获得分类后数据。
In a second aspect, the present invention provides a data classification device, including: an obtaining module for performing continuous encoding processing on discrete value attributes to obtain a second continuous value attribute, wherein the data includes the discrete value attribute and the first continuous value attribute; Module, used to train the second continuous value attribute using neural network,
Figure PCTCN2019072932-appb-000013
Hidden layer data as the third continuous value attribute, where the neural network includes
Figure PCTCN2019072932-appb-000014
A hidden layer; the acquisition module is used to merge the first continuous value attribute and the third continuous value attribute to obtain the fourth continuous value attribute; the acquisition module is also used to classify the fourth continuous value attribute to Obtain the classified data.
可选地,装置还包括:构建模块,用于构建目标函数,其中,目标函数为第三连续值属性的误差值和代入熵之和;训练模块,用于利用第二连续值属性对神经网络进行训练,直至目标函数的数值为最小值。Optionally, the device further includes: a construction module for constructing an objective function, where the objective function is the sum of the error value of the third continuous value attribute and the substituting entropy; and the training module is used for the neural network Perform training until the value of the objective function is the minimum value.
可选地,构建模块具体包括:相减模块,用于对第三连续值属性和第三连续值属性的理论值进行相减处理,以获得误差值;划分模块,用于对第三连续值属性进行数据集划分,以获得第一子数据集,其中,第一数据集包括多个第一子数据集;获得模块,用于获得第一子数据集的代入熵;叠加模块,用于对多个第一子数据集的代入熵进行叠加处理,以获得第三连续值属性的代入熵。Optionally, the building module specifically includes: a subtraction module, which is used to subtract the third continuous value attribute and the theoretical value of the third continuous value attribute to obtain an error value; and the division module is used to subtract the third continuous value attribute. The attributes are divided into data sets to obtain the first sub-data set, where the first data set includes a plurality of first sub-data sets; the obtaining module is used to obtain the substitution entropy of the first sub-data set; the superposition module is used to compare The substitution entropy of the multiple first sub-data sets is superimposed to obtain the substitution entropy of the third continuous value attribute.
可选地,构建模块具体包括:Optionally, the building module specifically includes:
根据第一公式获得第一子数据集的代入熵,其中,第一公式为:
Figure PCTCN2019072932-appb-000015
表示第一子数据,En[·]表示代入熵,
Figure PCTCN2019072932-appb-000016
表示数据的样本个数,
Figure PCTCN2019072932-appb-000017
b q表示核密度估计方法的窗口宽度,
Figure PCTCN2019072932-appb-000018
分别表示第一子数据中第n和m个元素。
The substitution entropy of the first sub-data set is obtained according to the first formula, where the first formula is:
Figure PCTCN2019072932-appb-000015
Represents the first subdata, En[·] represents the entropy,
Figure PCTCN2019072932-appb-000016
Indicates the number of samples of the data,
Figure PCTCN2019072932-appb-000017
b q represents the window width of the kernel density estimation method,
Figure PCTCN2019072932-appb-000018
Respectively represent the nth and mth elements in the first subdata.
可选地,构建模块具体包括:Optionally, the building module specifically includes:
根据第二公式获得第三连续值属性的代入熵,其中,第二公式为:
Figure PCTCN2019072932-appb-000019
其中,
Figure PCTCN2019072932-appb-000020
为第
Figure PCTCN2019072932-appb-000021
个隐含层含的节点数量,
Figure PCTCN2019072932-appb-000022
为第三连续值属性,
Figure PCTCN2019072932-appb-000023
The substitution entropy of the third continuous value attribute is obtained according to the second formula, where the second formula is:
Figure PCTCN2019072932-appb-000019
among them,
Figure PCTCN2019072932-appb-000020
For the first
Figure PCTCN2019072932-appb-000021
The number of nodes in a hidden layer,
Figure PCTCN2019072932-appb-000022
Is the third continuous value attribute,
Figure PCTCN2019072932-appb-000023
下面对电子设备和可读存储介质进行介绍,其实现原理和技术效果与上述方法原理和技术效果类似,此处不再赘述。The electronic device and the readable storage medium are introduced below, and their implementation principles and technical effects are similar to the principles and technical effects of the foregoing method, and will not be repeated here.
第三方面,本发明提供一种电子设备,包括:至少一个处理器和存储器;其中,存储器存储计算机执行指令;至少一个处理器执行存储器存储的计算机执行指令,使得至少一个处理器执行第一方面以及可选方案涉及的数据分类方法。In a third aspect, the present invention provides an electronic device comprising: at least one processor and a memory; wherein the memory stores computer-executable instructions; at least one processor executes the computer-executable instructions stored in the memory, so that at least one processor executes the first aspect And the data classification method involved in the optional plan.
第四方面,本发明提供一种计算机可读存储介质,其特征在于,计算机可读存储介质中存储有计算机执行指令,当处理器执行计算机执行指令时,实现第一方面以及可选方案涉及的数据分类方法。In a fourth aspect, the present invention provides a computer-readable storage medium, characterized in that the computer-readable storage medium stores computer-executable instructions, and when the processor executes the computer-executable instructions, the first aspect and the alternative solutions involved Data classification method.
本发明提供一种数据分类方法、装置、设备以及存储介质,在数据分类方法中,对离散值属性进行连续编码处理获得第二连续值属性;利用神经网络对第二连续值属性进行训练,将第
Figure PCTCN2019072932-appb-000024
层隐含层数据作为第三连续值属性,从而将 离散值属性彻底转化成带有序信息的、取值为实数的连续值属性。对第一连续值属性和第三连续值属性进行合并处理后在进行分类处理,以获得分类后数据,使得其分类精度高于现有技术中仅采用独热编码对混合值属性数据进行分类的精度。
The present invention provides a data classification method, device, equipment and storage medium. In the data classification method, a discrete value attribute is continuously encoded to obtain a second continuous value attribute; a neural network is used to train the second continuous value attribute, and First
Figure PCTCN2019072932-appb-000024
Hidden layer data is used as the third continuous value attribute, thereby completely transforming discrete value attributes into continuous value attributes with order information and real numbers. After the first continuous value attribute and the third continuous value attribute are combined, the classification process is performed to obtain the classified data, so that the classification accuracy is higher than that in the prior art that only uses one-hot encoding to classify mixed-value attribute data Accuracy.
附图说明Description of the drawings
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative labor.
图1为本发明根据一示例性实施例示出的数据分类方法的流程图;Fig. 1 is a flowchart of a data classification method according to an exemplary embodiment of the present invention;
图2为本发明根据一示例性实施例示出的数据分类方法的流程图;Fig. 2 is a flowchart of a data classification method according to an exemplary embodiment of the present invention;
图3为本发明根据一示例性实施例示出的数据分类装置的结构示意;Fig. 3 is a schematic diagram showing the structure of a data classification device according to an exemplary embodiment of the present invention;
图4为本发明根据一示例性实施例示出的电子设备的结构示意图。Fig. 4 is a schematic structural diagram of an electronic device according to an exemplary embodiment of the present invention.
具体实施方式detailed description
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of the embodiments of the present invention, not all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.
本发明提供一种数据分类方法、装置、设备以及存储介质,以解决现有的分类方法采用独热编码操作没有实现离散值属性的连续化的问题。The present invention provides a data classification method, device, equipment and storage medium to solve the problem that the existing classification method does not realize the continuity of discrete value attributes by using one-hot encoding operation.
图1为本发明根据一示例性实施例示出的数据分类方法的流程图。如图1所示,本实施例提供的数据分类方法,包括:Fig. 1 is a flowchart of a data classification method according to an exemplary embodiment of the present invention. As shown in Figure 1, the data classification method provided in this embodiment includes:
S101、对离散值属性进行连续编码处理获得第二连续值属性。S101. Perform continuous encoding processing on the discrete value attribute to obtain a second continuous value attribute.
更具体地,数据包括离散值属性和第一连续值属性。对离散值属性进行连续编码处理,以获得第二连续值属性,实现初步将离散值属性转化为连续值属性。More specifically, the data includes discrete value attributes and first continuous value attributes. Continuously encoding the discrete-valued attributes to obtain the second continuous-valued attributes, and realize the preliminary conversion of the discrete-valued attributes into continuous-valued attributes.
在本实施例中,可采用独热编码将将离散值属性转化为第二连续值属性。In this embodiment, one-hot encoding can be used to convert the discrete value attribute into the second continuous value attribute.
例如:假设现有如下表1所示的混合值属性数据集
Figure PCTCN2019072932-appb-000025
其中,数据集分为连 续值属性和离散值属性;
Figure PCTCN2019072932-appb-000026
Figure PCTCN2019072932-appb-000027
分别代表数据集
Figure PCTCN2019072932-appb-000028
包含的连续值和离散值属性的个数,
Figure PCTCN2019072932-appb-000029
代表数据集
Figure PCTCN2019072932-appb-000030
包含样本的个数;
Figure PCTCN2019072932-appb-000031
代表第
Figure PCTCN2019072932-appb-000032
个连续值属性,则
Figure PCTCN2019072932-appb-000033
代表第
Figure PCTCN2019072932-appb-000034
个离散值属性,假设其取值为
Figure PCTCN2019072932-appb-000035
代表离散值属性
Figure PCTCN2019072932-appb-000036
取值的个数,则
Figure PCTCN2019072932-appb-000037
代表第n个样本的类别,假设数据集
Figure PCTCN2019072932-appb-000038
共有
Figure PCTCN2019072932-appb-000039
个类别
Figure PCTCN2019072932-appb-000040
Figure PCTCN2019072932-appb-000041
For example: Suppose there is an existing mixed-value attribute data set shown in Table 1 below
Figure PCTCN2019072932-appb-000025
Among them, the data set is divided into continuous value attributes and discrete value attributes;
Figure PCTCN2019072932-appb-000026
with
Figure PCTCN2019072932-appb-000027
Respectively represent the data set
Figure PCTCN2019072932-appb-000028
The number of continuous and discrete value attributes included,
Figure PCTCN2019072932-appb-000029
Representative data set
Figure PCTCN2019072932-appb-000030
Contains the number of samples;
Figure PCTCN2019072932-appb-000031
Representative
Figure PCTCN2019072932-appb-000032
Continuous value attributes, then
Figure PCTCN2019072932-appb-000033
Representative
Figure PCTCN2019072932-appb-000034
Discrete-valued attributes, assuming its value is
Figure PCTCN2019072932-appb-000035
Represents discrete value attributes
Figure PCTCN2019072932-appb-000036
The number of values, then
Figure PCTCN2019072932-appb-000037
Represents the category of the nth sample, assuming the data set
Figure PCTCN2019072932-appb-000038
Share
Figure PCTCN2019072932-appb-000039
Categories
Figure PCTCN2019072932-appb-000040
then
Figure PCTCN2019072932-appb-000041
表1:混合值属性数据集Table 1: Mixed-value attribute data set
Figure PCTCN2019072932-appb-000042
Figure PCTCN2019072932-appb-000042
将如下表2所示由离散值属性构成的数据集
Figure PCTCN2019072932-appb-000043
进行独热编码,得到下表3独热编码数据集
Figure PCTCN2019072932-appb-000044
The data set composed of discrete value attributes as shown in Table 2 below
Figure PCTCN2019072932-appb-000043
Perform one-hot encoding and get the one-hot encoding data set in Table 3 below
Figure PCTCN2019072932-appb-000044
表2:离散值属性数据集
Figure PCTCN2019072932-appb-000045
Table 2: Discrete value attribute data set
Figure PCTCN2019072932-appb-000045
Figure PCTCN2019072932-appb-000046
Figure PCTCN2019072932-appb-000046
表3:独热编码数据集
Figure PCTCN2019072932-appb-000047
Table 3: One-hot encoding data set
Figure PCTCN2019072932-appb-000047
Figure PCTCN2019072932-appb-000048
Figure PCTCN2019072932-appb-000048
在表3中,满足如下公式(1):In Table 3, the following formula (1) is satisfied:
Figure PCTCN2019072932-appb-000049
Figure PCTCN2019072932-appb-000049
S102、利用神经网络对第二连续值属性进行训练,以将第
Figure PCTCN2019072932-appb-000050
层隐含层数据作为第三连续值属性。
S102. Use a neural network to train the second continuous value attribute to
Figure PCTCN2019072932-appb-000050
The layer implicit layer data serves as the third continuous value attribute.
更具体地,神经网络包括
Figure PCTCN2019072932-appb-000051
个隐含层。将第二连续值属性输入神经网络,进行训练,并将神经网络的第
Figure PCTCN2019072932-appb-000052
层隐含层数据作为第三连续值属性输出。
More specifically, the neural network includes
Figure PCTCN2019072932-appb-000051
Hidden layer. The second continuous value attribute is input to the neural network for training, and the first
Figure PCTCN2019072932-appb-000052
The hidden layer data is output as the third continuous value attribute.
在本实施例中,构建一个含有
Figure PCTCN2019072932-appb-000053
个隐含层的神经网络,将其称为编码神经网络(Encoding Neural Network,简称ENN),其中,
Figure PCTCN2019072932-appb-000054
并以表3所示的独热编码数据集作为输入。
In this example, construct a
Figure PCTCN2019072932-appb-000053
A neural network with a hidden layer is called an Encoding Neural Network (ENN). Among them,
Figure PCTCN2019072932-appb-000054
And take the one hot encoding data set shown in Table 3 as input.
ENN的输入用公式(2)表示。The input of ENN is expressed by formula (2).
Figure PCTCN2019072932-appb-000055
Figure PCTCN2019072932-appb-000055
ENN的输入层节点个数为:The number of input layer nodes of ENN is:
Figure PCTCN2019072932-appb-000056
Figure PCTCN2019072932-appb-000056
ENN的输出用公式(4)表示:The output of ENN is expressed by formula (4):
Figure PCTCN2019072932-appb-000057
Figure PCTCN2019072932-appb-000057
ENN的输出层节点的个数为
Figure PCTCN2019072932-appb-000058
The number of output layer nodes of ENN is
Figure PCTCN2019072932-appb-000058
隐含层节点使用Sigmoid函数激活其输入,第f个隐含层含有
Figure PCTCN2019072932-appb-000059
个节点,其中,
Figure PCTCN2019072932-appb-000060
第f个隐含层用公式(5)表示。
The hidden layer node uses the Sigmoid function to activate its input, and the f-th hidden layer contains
Figure PCTCN2019072932-appb-000059
Nodes, where
Figure PCTCN2019072932-appb-000060
The f-th hidden layer is expressed by formula (5).
Figure PCTCN2019072932-appb-000061
Figure PCTCN2019072932-appb-000061
在构建完ENN后,利用神经网络对第二连续值属性进行训练,以将第
Figure PCTCN2019072932-appb-000062
层隐含层数据作为第三连续值属性,第三连续值属性数据集如表4所示。
After constructing ENN, use neural network to train the second continuous value attribute to
Figure PCTCN2019072932-appb-000062
The hidden layer data is used as the third continuous value attribute, and the third continuous value attribute data set is shown in Table 4.
表4:第三连续值属性数据集
Figure PCTCN2019072932-appb-000063
Table 4: The third continuous value attribute data set
Figure PCTCN2019072932-appb-000063
Figure PCTCN2019072932-appb-000064
Figure PCTCN2019072932-appb-000064
S103、对第一连续值属性和第三连续值属性进行合并处理,以获得第四连续值属性。S103: Perform merging processing on the first continuous value attribute and the third continuous value attribute to obtain a fourth continuous value attribute.
更具体地,对第一连续值属性和第三连续值属性合并获得第四连续值属性,其中,第四连续值属性包括第一连续值属性和第三连续值属性。More specifically, the first continuous value attribute and the third continuous value attribute are combined to obtain a fourth continuous value attribute, where the fourth continuous value attribute includes the first continuous value attribute and the third continuous value attribute.
在本实施例中,第三连续值属性表示为:In this embodiment, the third continuous value attribute is expressed as:
Figure PCTCN2019072932-appb-000065
Figure PCTCN2019072932-appb-000065
将表1中的第一连续值属性对应的数据与表4中的第三连续值属性对应的数据进行合并,得到属性取值全部为实数值的第四连续值数据集
Figure PCTCN2019072932-appb-000066
如表5所 示。
Combine the data corresponding to the first continuous value attribute in Table 1 with the data corresponding to the third continuous value attribute in Table 4 to obtain a fourth continuous value data set whose attribute values are all real values
Figure PCTCN2019072932-appb-000066
As shown in Table 5.
表5:实数值属性数据集
Figure PCTCN2019072932-appb-000067
Table 5: Real-valued attribute data set
Figure PCTCN2019072932-appb-000067
Figure PCTCN2019072932-appb-000068
Figure PCTCN2019072932-appb-000068
S104、对第四连续值属性进行分类处理,以获得分类后数据。S104: Perform classification processing on the fourth continuous value attribute to obtain classified data.
更具体地,可以采用任意针对连续值属性数据的分类方法,例如:支持向量机和神经网络,处理实数值属性数据集
Figure PCTCN2019072932-appb-000069
More specifically, any classification method for continuous-valued attribute data can be used, such as support vector machines and neural networks, to process real-valued attribute data sets
Figure PCTCN2019072932-appb-000069
在本实施例提供的数据分类方法中,对离散值属性进行连续编码处理获得第二连续值属性;利用神经网络对第二连续值属性进行训练,将第
Figure PCTCN2019072932-appb-000070
层隐含层数据作为第三连续值属性,从而将离散值属性彻底转化成带有序信息的、取值为实数的连续值属性。对第一连续值属性和第三连续值属性进行合并处理后在进行分类处理,以获得分类后数据,使得其分类精度高于现有技术中仅采用独热编码对混合值属性数据进行分类的精度。
In the data classification method provided in this embodiment, the discrete value attribute is continuously encoded to obtain the second continuous value attribute; the neural network is used to train the second continuous value attribute, and the first
Figure PCTCN2019072932-appb-000070
Hidden layer data is used as the third continuous value attribute, thereby completely transforming discrete value attributes into continuous value attributes with order information and real numbers. After the first continuous value attribute and the third continuous value attribute are combined, the classification process is performed to obtain the classified data, so that the classification accuracy is higher than that in the prior art that only uses one-hot encoding to classify mixed-value attribute data Accuracy.
图2为本发明根据一示例性实施例示出的数据分类方法的流程图。如图2所示,本实施例提供的数据分类方法,包括:Fig. 2 is a flowchart of a data classification method according to an exemplary embodiment of the present invention. As shown in Figure 2, the data classification method provided in this embodiment includes:
S201、对离散值属性进行连续编码处理获得第二连续值属性。S201: Perform continuous encoding processing on the discrete value attribute to obtain a second continuous value attribute.
S202、构建目标函数,利用第二连续值属性对神经网络进行训练,直至目标函数的数值为最小值。S202. Construct an objective function, and use the second continuous value attribute to train the neural network until the value of the objective function is the minimum value.
更具体地,目标函数为第三连续值属性的误差值和代入熵之和。More specifically, the objective function is the sum of the error value of the third continuous value attribute and the substituted entropy.
根据公式(6)表示ENN的损失函数L:According to formula (6), the loss function L of ENN is expressed:
Figure PCTCN2019072932-appb-000071
Figure PCTCN2019072932-appb-000071
其中,E[·]为ENN对应第三连续值属性数据集
Figure PCTCN2019072932-appb-000072
的实际输出与理论输出之间的误差,U[·]为第
Figure PCTCN2019072932-appb-000073
层隐含层数据
Figure PCTCN2019072932-appb-000074
的不确定性。
Among them, E[·] is the third continuous value attribute data set corresponding to ENN
Figure PCTCN2019072932-appb-000072
The error between the actual output and the theoretical output, U[·] is the first
Figure PCTCN2019072932-appb-000073
Hidden layer data
Figure PCTCN2019072932-appb-000074
Uncertainty.
误差值可以通过对第三连续值属性和第三连续值属性的理论值进行相减处 理获得。The error value can be obtained by subtracting the theoretical value of the third continuous value attribute and the third continuous value attribute.
采用如下步骤计算第三连续值属性的代入熵:The following steps are used to calculate the substitution entropy of the third continuous value attribute:
S301、对第三连续值属性进行数据集划分,以获得第一子数据集。S301: Perform data set division on the third continuous value attribute to obtain a first sub-data set.
更具体地,第一数据集包括多个第一子数据集。More specifically, the first data set includes a plurality of first sub-data sets.
其中,第三连续值属性数据集表示为:Among them, the third continuous value attribute data set is expressed as:
Figure PCTCN2019072932-appb-000075
Figure PCTCN2019072932-appb-000075
第一子数据集表示为:The first sub-data set is expressed as:
Figure PCTCN2019072932-appb-000076
Figure PCTCN2019072932-appb-000076
S302、获得第一子数据集的代入熵。S302. Obtain the substitution entropy of the first sub-data set.
更具体地,第一子数据集的代入熵计算方式如下:More specifically, the substitution entropy calculation method of the first sub-data set is as follows:
Figure PCTCN2019072932-appb-000077
Figure PCTCN2019072932-appb-000077
其中,
Figure PCTCN2019072932-appb-000078
为数据集
Figure PCTCN2019072932-appb-000079
对应的代入熵,
Figure PCTCN2019072932-appb-000080
表示在数据集
Figure PCTCN2019072932-appb-000081
上利用核密度估计方法获得的数据集
Figure PCTCN2019072932-appb-000082
的概率密度函数。
among them,
Figure PCTCN2019072932-appb-000078
Data set
Figure PCTCN2019072932-appb-000079
Corresponding to the entropy,
Figure PCTCN2019072932-appb-000080
Represented in the data set
Figure PCTCN2019072932-appb-000081
The data set obtained by the kernel density estimation method
Figure PCTCN2019072932-appb-000082
The probability density function.
Figure PCTCN2019072932-appb-000083
的计算方式如下:
Figure PCTCN2019072932-appb-000083
Is calculated as follows:
Figure PCTCN2019072932-appb-000084
Figure PCTCN2019072932-appb-000084
其中,b q表示核密度估计方法的窗口宽度参数,b q>0,b q且为关于样本个数
Figure PCTCN2019072932-appb-000085
的函数,其满足如下条件:
Among them, b q represents the window width parameter of the kernel density estimation method, b q > 0, and b q is about the number of samples
Figure PCTCN2019072932-appb-000085
The function of, which satisfies the following conditions:
Figure PCTCN2019072932-appb-000086
Figure PCTCN2019072932-appb-000086
S302、对多个第一子数据集的代入熵进行叠加处理,以获得第三连续值属性的代入熵。S302: Perform superposition processing on the substitution entropy of the multiple first sub-data sets to obtain the substitution entropy of the third continuous value attribute.
更具体地,第三连续值属性的代入熵U[·]计算方式如下:More specifically, the substitution entropy U[·] of the third continuous value attribute is calculated as follows:
Figure PCTCN2019072932-appb-000087
Figure PCTCN2019072932-appb-000087
构建上述目的函数其获得使得损失函数最小化的第
Figure PCTCN2019072932-appb-000088
层隐含层输出矩阵
Figure PCTCN2019072932-appb-000089
对神经网络的训练过程采用传统神经网络的训练模式,在此不再赘述。
Construct the above objective function to obtain the first that minimizes the loss function
Figure PCTCN2019072932-appb-000088
Hidden layer output matrix
Figure PCTCN2019072932-appb-000089
The training process of neural network adopts the training mode of traditional neural network, which will not be repeated here.
S203、利用神经网络对第二连续值属性进行训练,以将第
Figure PCTCN2019072932-appb-000090
层隐含层数据作为第三连续值属性。
S203. Use a neural network to train the second continuous value attribute to
Figure PCTCN2019072932-appb-000090
The layer implicit layer data serves as the third continuous value attribute.
S204、对第一连续值属性和第三连续值属性进行合并处理,以获得第四连续值属性。S204: Perform merging processing on the first continuous value attribute and the third continuous value attribute to obtain a fourth continuous value attribute.
S205、对第四连续值属性进行分类处理,以获得分类后数据。S205: Perform classification processing on the fourth continuous value attribute to obtain classified data.
本在本实施例提供的数据分类方法中,在设计用于离散值属性连续化的神经网络时,我们将数据集的不确定性引入到了损失函数中,即除了保证实际输出与真实输出之间的误差最小外,还保证了转换之后数据集的不确定性最小。实验结果表明,与传统的独热编码方法相比,深度编码使得支持向量机和神经网络在混合属性数据集上获得了更高的分类精度。In the data classification method provided in this embodiment, when designing a neural network for the continuity of discrete-valued attributes, we introduce the uncertainty of the data set into the loss function, that is, in addition to ensuring that the actual output and the true output are between In addition to the smallest error, it also ensures the smallest uncertainty of the data set after conversion. Experimental results show that, compared with the traditional one-hot encoding method, deep encoding enables support vector machines and neural networks to obtain higher classification accuracy on mixed attribute data sets.
图3为本发明根据一示例性实施例示出的数据分类装置的结构示意。如图3所示,本实施例提供一种数据分类装置,包括:获得模块101,用于对离散值属性进行连续编码处理获得第二连续值属性,其中,数据包括离散值属性和第一连续值属性;作为模块102,用于利用神经网络对第二连续值属性进行训练,将第
Figure PCTCN2019072932-appb-000091
层隐含层数据作为第三连续值属性,其中,神经网络包括
Figure PCTCN2019072932-appb-000092
个隐含层;获得模块101还用于对第一连续值属性和第三连续值属性进行合并处理,以获得第四连续值属性;获得模块101还用于对第四连续值属性进行分类处理,以获得分类后数据。
Fig. 3 is a schematic diagram showing the structure of a data classification device according to an exemplary embodiment of the present invention. As shown in FIG. 3, this embodiment provides a data classification device, including: an obtaining module 101, configured to perform continuous encoding processing on discrete value attributes to obtain a second continuous value attribute, where the data includes the discrete value attribute and the first continuous value attribute. Value attribute; as the module 102, it is used to train the second continuous value attribute using a neural network, and the first
Figure PCTCN2019072932-appb-000091
Hidden layer data as the third continuous value attribute, where the neural network includes
Figure PCTCN2019072932-appb-000092
Hidden layers; the obtaining module 101 is also used to merge the first continuous value attribute and the third continuous value attribute to obtain the fourth continuous value attribute; the obtaining module 101 is also used to classify the fourth continuous value attribute To obtain classified data.
可选地,装置还包括:构建模块103,用于构建目标函数,其中,目标函数为第三连续值属性的误差值和代入熵之和;训练模块104,用于利用第二连续值属性对神经网络进行训练,直至目标函数的数值为最小值。Optionally, the device further includes: a construction module 103 for constructing an objective function, where the objective function is the sum of the error value and the substitution entropy of the third continuous value attribute; the training module 104 is used for using the second continuous value attribute pair The neural network is trained until the value of the objective function is the minimum.
可选地,构建模块103具体包括:对第三连续值属性和第三连续值属性的理论值进行相减处理,以获得误差值;对第三连续值属性进行数据集划分,以获得第一子数据集,其中,第一数据集包括多个第一子数据集;获得第一子数据集的代入熵;叠加模块,用于对多个第一子数据集的代入熵进行叠加处理,以获得第三连续值属性的代入熵。Optionally, the construction module 103 specifically includes: performing subtraction processing on the third continuous value attribute and theoretical values of the third continuous value attribute to obtain an error value; performing data set division on the third continuous value attribute to obtain the first A sub-data set, where the first data set includes a plurality of first sub-data sets; the substitution entropy of the first sub-data set is obtained; the superposition module is used to superimpose the substitution entropy of the plurality of first sub-data sets to Obtain the substitution entropy of the third continuous value attribute.
可选地,构建模块103具体包括:Optionally, the building module 103 specifically includes:
根据第一公式获得第一子数据集的代入熵,其中,第一公式为:
Figure PCTCN2019072932-appb-000093
表示第一子数据,Entropy[·]表示代入熵,
Figure PCTCN2019072932-appb-000094
表示数据的样本个数,
Figure PCTCN2019072932-appb-000095
b q表示核密度估计方法的窗口宽度,
Figure PCTCN2019072932-appb-000096
分别表示第一子数据中第n和m个元素。
The substitution entropy of the first sub-data set is obtained according to the first formula, where the first formula is:
Figure PCTCN2019072932-appb-000093
Represents the first sub-data, Entropy[·] represents the entropy,
Figure PCTCN2019072932-appb-000094
Indicates the number of samples of the data,
Figure PCTCN2019072932-appb-000095
b q represents the window width of the kernel density estimation method,
Figure PCTCN2019072932-appb-000096
Respectively represent the nth and mth elements in the first subdata.
可选地,构建模块103具体包括:Optionally, the building module 103 specifically includes:
根据第二公式获得第三连续值属性的代入熵,其中,第二公式为:
Figure PCTCN2019072932-appb-000097
其中,
Figure PCTCN2019072932-appb-000098
为第
Figure PCTCN2019072932-appb-000099
个隐含层含的节点数量,
Figure PCTCN2019072932-appb-000100
为第三连续值属性,
Figure PCTCN2019072932-appb-000101
The substitution entropy of the third continuous value attribute is obtained according to the second formula, where the second formula is:
Figure PCTCN2019072932-appb-000097
among them,
Figure PCTCN2019072932-appb-000098
For the first
Figure PCTCN2019072932-appb-000099
The number of nodes in a hidden layer,
Figure PCTCN2019072932-appb-000100
Is the third continuous value attribute,
Figure PCTCN2019072932-appb-000101
总之,本申请提供的数据分类装置可用于执行上述数据分类方法,其内容和效果可参考方法部分,本申请对此不再赘述。In short, the data classification device provided by this application can be used to implement the above data classification method, and its content and effects can be referred to the method part, which will not be repeated in this application.
图4为本发明根据一示例性实施例示出的电子设备的结构示意图。如图4所示,本实施例的电子设备200包括:处理器201以及存储器202,其中,Fig. 4 is a schematic structural diagram of an electronic device according to an exemplary embodiment of the present invention. As shown in FIG. 4, the electronic device 200 of this embodiment includes a processor 201 and a memory 202, where:
存储器202,用于存储计算机执行指令;The memory 202 is used to store computer execution instructions;
处理器201,用于执行存储器存储的计算机执行指令,以实现上述实施例中接收设备所执行的各个步骤。具体可以参见前述方法实施例中的相关描述。The processor 201 is configured to execute computer-executable instructions stored in the memory to implement each step executed by the receiving device in the foregoing embodiment. For details, refer to the related description in the foregoing method embodiment.
可选的,存储器202既可以是独立的,也可以跟处理器201集成在一起。Optionally, the memory 202 may be independent or integrated with the processor 201.
当存储器202独立设置时,该流量控制设备200还包括总线203,用于连接存储器202和处理器201。When the memory 202 is independently provided, the flow control device 200 further includes a bus 203 for connecting the memory 202 and the processor 201.
本发明实施例还提供一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机执行指令,当处理器执行所述计算机执行指令时,实现如上所述的数据分类方法。The embodiment of the present invention also provides a computer-readable storage medium, in which computer-executable instructions are stored, and when the processor executes the computer-executable instructions, the data classification method as described above is implemented.
最后应说明的是:以上各实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述各实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand: The technical solutions recorded in the foregoing embodiments can still be modified, or some or all of the technical features can be equivalently replaced; these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the embodiments of the present invention range.

Claims (10)

  1. 一种数据分类方法,其特征在于,包括:A data classification method is characterized in that it includes:
    对离散值属性进行连续编码处理获得第二连续值属性,其中,数据包括所述离散值属性和第一连续值属性;Continuously encoding the discrete value attribute to obtain a second continuous value attribute, where the data includes the discrete value attribute and the first continuous value attribute;
    利用神经网络对所述第二连续值属性进行训练,将第
    Figure PCTCN2019072932-appb-100001
    层隐含层数据作为第三连续值属性,其中,所述神经网络包括
    Figure PCTCN2019072932-appb-100002
    个隐含层;
    Use the neural network to train the second continuous value attribute, and the first
    Figure PCTCN2019072932-appb-100001
    Hidden layer data as the third continuous value attribute, wherein the neural network includes
    Figure PCTCN2019072932-appb-100002
    Hidden layer
    对所述第一连续值属性和所述第三连续值属性进行合并处理,以获得第四连续值属性;Merging the first continuous value attribute and the third continuous value attribute to obtain a fourth continuous value attribute;
    对所述第四连续值属性进行分类处理,以获得分类后数据。Perform classification processing on the fourth continuous value attribute to obtain classified data.
  2. 根据权利要求1所述的方法,其特征在于,在所述利用神经网络对所述第二连续值属性进行训练,将第
    Figure PCTCN2019072932-appb-100003
    层隐含层数据作为第三连续值属性,之前还包括:
    The method according to claim 1, characterized in that, in the training of the second continuous value attribute by the neural network, the first
    Figure PCTCN2019072932-appb-100003
    Hidden layer data as the third continuous value attribute, previously included:
    构建目标函数,其中,所述目标函数为所述第三连续值属性的误差值和代入熵之和;Constructing an objective function, wherein the objective function is the sum of the error value of the third continuous value attribute and the substituted entropy;
    利用所述第二连续值属性对所述神经网络进行训练,直至所述目标函数的数值为最小值。The neural network is trained using the second continuous value attribute until the value of the objective function is the minimum value.
  3. 根据权利要求2所述的方法,其特征在于,所述构建目标函数,具体包括:The method according to claim 2, wherein the constructing the objective function specifically comprises:
    对所述第三连续值属性和所述第三连续值属性的理论值进行相减处理,以获得所述误差值;Performing subtraction processing on the theoretical value of the third continuous value attribute and the third continuous value attribute to obtain the error value;
    对所述第三连续值属性进行数据集划分,以获得第一子数据集,其中,第一数据集包括多个第一子数据集;Data set division is performed on the third continuous value attribute to obtain a first sub-data set, where the first data set includes a plurality of first sub-data sets;
    获得所述第一子数据集的代入熵;Obtaining the substitution entropy of the first sub-data set;
    对多个所述第一子数据集的代入熵进行叠加处理,以获得所述第三连续值属性的代入熵。Perform superposition processing on the substitution entropy of the multiple first sub-data sets to obtain the substitution entropy of the third continuous value attribute.
  4. 根据权利要求3所述的方法,其特征在于,所述获得所述第一子数据集的代入熵,具体包括:The method according to claim 3, wherein the obtaining the substitution entropy of the first sub-data set specifically comprises:
    根据第一公式获得所述第一子数据集的代入熵,其中,所述第一公式为:
    Figure PCTCN2019072932-appb-100004
    表示第一子数据,En[·]表示代入熵,
    Figure PCTCN2019072932-appb-100005
    表 示数据的样本个数,
    Figure PCTCN2019072932-appb-100006
    b q表示核密度估计方法的窗口宽度,
    Figure PCTCN2019072932-appb-100007
    分别表示第一子数据中第n和m个元素。
    The substitution entropy of the first sub-data set is obtained according to the first formula, where the first formula is:
    Figure PCTCN2019072932-appb-100004
    Represents the first subdata, En[·] represents the entropy,
    Figure PCTCN2019072932-appb-100005
    Indicates the number of samples of the data,
    Figure PCTCN2019072932-appb-100006
    b q represents the window width of the kernel density estimation method,
    Figure PCTCN2019072932-appb-100007
    Respectively represent the nth and mth elements in the first subdata.
  5. 根据权利要求3所述的方法,其特征在于,所述对多个所述第一子数据集的代入熵进行叠加处理,以获得所述第三连续值属性的代入熵,具体包括:The method according to claim 3, wherein the superimposing the substitution entropy of the multiple first sub-data sets to obtain the substitution entropy of the third continuous value attribute specifically comprises:
    根据第二公式获得第三连续值属性的代入熵,其中,所述第二公式为:
    Figure PCTCN2019072932-appb-100008
    其中,
    Figure PCTCN2019072932-appb-100009
    为第
    Figure PCTCN2019072932-appb-100010
    个隐含层含的节点数量,
    Figure PCTCN2019072932-appb-100011
    为所述第三连续值属性,
    Figure PCTCN2019072932-appb-100012
    The substitution entropy of the third continuous value attribute is obtained according to the second formula, where the second formula is:
    Figure PCTCN2019072932-appb-100008
    among them,
    Figure PCTCN2019072932-appb-100009
    For the first
    Figure PCTCN2019072932-appb-100010
    The number of nodes in a hidden layer,
    Figure PCTCN2019072932-appb-100011
    Is the third continuous value attribute,
    Figure PCTCN2019072932-appb-100012
  6. 一种数据分类装置,其特征在于,包括:A data classification device, characterized in that it comprises:
    获得模块,用于对离散值属性进行连续编码处理获得第二连续值属性,其中,数据包括所述离散值属性和第一连续值属性;An obtaining module, configured to perform continuous encoding processing on discrete value attributes to obtain a second continuous value attribute, wherein the data includes the discrete value attribute and the first continuous value attribute;
    作为模块,用于利用神经网络对所述第二连续值属性进行训练,将第
    Figure PCTCN2019072932-appb-100013
    层隐含层数据作为第三连续值属性,其中,所述神经网络包括
    Figure PCTCN2019072932-appb-100014
    个隐含层;
    As a module, it is used to train the second continuous value attribute using a neural network, and the first
    Figure PCTCN2019072932-appb-100013
    Hidden layer data as the third continuous value attribute, wherein the neural network includes
    Figure PCTCN2019072932-appb-100014
    Hidden layer
    所述获得模块还用于对所述第一连续值属性和所述第三连续值属性进行合并处理,以获得第四连续值属性;The obtaining module is further configured to merge the first continuous value attribute and the third continuous value attribute to obtain a fourth continuous value attribute;
    所述获得模块还用于对所述第四连续值属性进行分类处理,以获得分类后数据。The obtaining module is further configured to perform classification processing on the fourth continuous value attribute to obtain classified data.
  7. 根据权利要求6所述的装置,其特征在于,所述装置还包括:The device according to claim 6, wherein the device further comprises:
    构建模块,用于构建目标函数,其中,所述目标函数为所述第三连续值属性的误差值和代入熵之和;A building module for building an objective function, wherein the objective function is the sum of the error value of the third continuous value attribute and the substituted entropy;
    训练模块,用于利用所述第二连续值属性对所述神经网络进行训练,直至所述目标函数的数值为最小值。The training module is used to train the neural network by using the second continuous value attribute until the value of the objective function is the minimum value.
  8. 根据权利要求7所述的装置,其特征在于,所述构建模块具体用于:The device according to claim 7, wherein the building module is specifically used for:
    相减模块,用于对所述第三连续值属性和所述第三连续值属性的理论值进行相减处理,以获得所述误差值;A subtraction module, configured to perform subtraction processing on the third continuous value attribute and the theoretical value of the third continuous value attribute to obtain the error value;
    划分模块,用于对所述第三连续值属性进行数据集划分,以获得第一子数据集,其中,第一数据集包括多个第一子数据集;A dividing module, configured to divide a data set of the third continuous value attribute to obtain a first sub-data set, where the first data set includes a plurality of first sub-data sets;
    获得模块,用于获得所述第一子数据集的代入熵;An obtaining module for obtaining the substitution entropy of the first sub-data set;
    叠加模块,用于对多个所述第一子数据集的代入熵进行叠加处理,以获得所述第三连续值属性的代入熵。The superposition module is used to superimpose the substitution entropy of a plurality of the first sub-data sets to obtain the substitution entropy of the third continuous value attribute.
  9. 一种电子设备,其特征在于,包括:至少一个处理器和存储器;An electronic device, characterized by comprising: at least one processor and a memory;
    其中,所述存储器存储计算机执行指令;Wherein, the memory stores computer execution instructions;
    所述至少一个处理器执行所述存储器存储的计算机执行指令,使得所述至少一个处理器执行如权利要求1至5任一项所述的数据分类方法。The at least one processor executes computer-executable instructions stored in the memory, so that the at least one processor executes the data classification method according to any one of claims 1 to 5.
  10. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有计算机执行指令,当处理器执行所述计算机执行指令时,实现如权利要求1至5任一项所述的数据分类方法。A computer-readable storage medium, wherein a computer-executable instruction is stored in the computer-readable storage medium, and when the processor executes the computer-executable instruction, the computer-executable instruction is implemented as described in any one of claims 1 to 5 Data classification method.
PCT/CN2019/072932 2019-01-24 2019-01-24 Data classification method and apparatus, and device and storage medium WO2020150955A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/072932 WO2020150955A1 (en) 2019-01-24 2019-01-24 Data classification method and apparatus, and device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/072932 WO2020150955A1 (en) 2019-01-24 2019-01-24 Data classification method and apparatus, and device and storage medium

Publications (1)

Publication Number Publication Date
WO2020150955A1 true WO2020150955A1 (en) 2020-07-30

Family

ID=71736027

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/072932 WO2020150955A1 (en) 2019-01-24 2019-01-24 Data classification method and apparatus, and device and storage medium

Country Status (1)

Country Link
WO (1) WO2020150955A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0877132A (en) * 1994-08-31 1996-03-22 Victor Co Of Japan Ltd Learning method for cross coupling type neural network
CN105786860A (en) * 2014-12-23 2016-07-20 华为技术有限公司 Data processing method and device in data modeling
CN108362510A (en) * 2017-11-30 2018-08-03 中国航空综合技术研究所 A kind of engineering goods method of fault pattern recognition based on evidence neural network model
CN108628868A (en) * 2017-03-16 2018-10-09 北京京东尚科信息技术有限公司 File classification method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0877132A (en) * 1994-08-31 1996-03-22 Victor Co Of Japan Ltd Learning method for cross coupling type neural network
CN105786860A (en) * 2014-12-23 2016-07-20 华为技术有限公司 Data processing method and device in data modeling
CN108628868A (en) * 2017-03-16 2018-10-09 北京京东尚科信息技术有限公司 File classification method and device
CN108362510A (en) * 2017-11-30 2018-08-03 中国航空综合技术研究所 A kind of engineering goods method of fault pattern recognition based on evidence neural network model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SUN, JINGUANG ET AL.: "DBN Classification Algorithm for Numerical Attribute", COMPUTER ENGINEERING AND APPLICATIONS, vol. 50, no. 2, 15 January 2014 (2014-01-15), pages 112 - 114, ISSN: 1002-8331 *

Similar Documents

Publication Publication Date Title
JP2022056316A (en) Character structuring extraction method and device, electronic apparatus, storage medium, and computer program
WO2021000556A1 (en) Method and system for predicting remaining useful life of industrial equipment, and electronic device
JP7347202B2 (en) Device and method for training a classification model and classification device using the classification model
WO2018099084A1 (en) Method, device, chip and system for training neural network model
WO2021027193A1 (en) Face clustering method and apparatus, device and storage medium
WO2021082863A1 (en) Method and device for evaluating consensus node
WO2021031825A1 (en) Network fraud identification method and device, computer device, and storage medium
US11587291B2 (en) Systems and methods of contrastive point completion with fine-to-coarse refinement
CN110889416B (en) Salient object detection method based on cascade improved network
WO2022142001A1 (en) Target object evaluation method based on multi-score card fusion, and related device therefor
US11386507B2 (en) Tensor-based predictions from analysis of time-varying graphs
CN109787958B (en) Network flow real-time detection method, detection terminal and computer readable storage medium
CN114330670A (en) Graph neural network training method, device, equipment and storage medium
Rehman et al. A modified self‐adaptive extragradient method for pseudomonotone equilibrium problem in a real Hilbert space with applications
CN108924385B (en) Video de-jittering method based on width learning
CN113205495B (en) Image quality evaluation and model training method, device, equipment and storage medium
CN113240177B (en) Method for training prediction model, prediction method, device, electronic equipment and medium
CN113672735A (en) Link prediction method based on theme perception heterogeneous graph neural network
WO2020150955A1 (en) Data classification method and apparatus, and device and storage medium
CN113723072A (en) RPA (resilient packet Access) and AI (Artificial Intelligence) combined model fusion result acquisition method and device and electronic equipment
CN113641829A (en) Method and device for training neural network of graph and complementing knowledge graph
CN115953651B (en) Cross-domain equipment-based model training method, device, equipment and medium
WO2024001653A1 (en) Feature extraction method and apparatus, storage medium, and electronic device
WO2021035980A1 (en) Facial recognition model training method and apparatus, and device and readable storage medium
CN115186738B (en) Model training method, device and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19911376

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS (EPO FORM 1205A DATED 15.09.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19911376

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 07.04.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 19911376

Country of ref document: EP

Kind code of ref document: A1