CN117473421A - Data classification method and device, storage medium and electronic equipment - Google Patents
Data classification method and device, storage medium and electronic equipment Download PDFInfo
- Publication number
- CN117473421A CN117473421A CN202311304121.0A CN202311304121A CN117473421A CN 117473421 A CN117473421 A CN 117473421A CN 202311304121 A CN202311304121 A CN 202311304121A CN 117473421 A CN117473421 A CN 117473421A
- Authority
- CN
- China
- Prior art keywords
- data
- sample
- training
- target field
- sample set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 59
- 238000012549 training Methods 0.000 claims abstract description 80
- 238000007637 random forest analysis Methods 0.000 claims abstract description 53
- 238000003066 decision tree Methods 0.000 claims abstract description 41
- 238000012545 processing Methods 0.000 claims abstract description 35
- 238000013528 artificial neural network Methods 0.000 claims abstract description 33
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 26
- 238000007781 pre-processing Methods 0.000 claims abstract description 19
- 238000000605 extraction Methods 0.000 claims description 20
- 230000015654 memory Effects 0.000 claims description 14
- 230000002159 abnormal effect Effects 0.000 claims description 11
- 238000004590 computer program Methods 0.000 claims description 10
- 238000013501 data transformation Methods 0.000 claims description 9
- 125000004122 cyclic group Chemical group 0.000 claims description 7
- 238000005070 sampling Methods 0.000 claims description 7
- 238000013527 convolutional neural network Methods 0.000 claims description 6
- 230000004927 fusion Effects 0.000 claims description 4
- 238000010606 normalization Methods 0.000 claims description 4
- 238000011176 pooling Methods 0.000 claims description 4
- 230000000306 recurrent effect Effects 0.000 claims description 3
- 230000006870 function Effects 0.000 description 17
- 239000013598 vector Substances 0.000 description 12
- 238000010586 diagram Methods 0.000 description 10
- 230000008569 process Effects 0.000 description 10
- 230000004913 activation Effects 0.000 description 9
- 238000004891 communication Methods 0.000 description 6
- 230000003287 optical effect Effects 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 3
- 230000009471 action Effects 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 239000013307 optical fiber Substances 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 238000010187 selection method Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 238000002759 z-score normalization Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/20—Ensemble learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/01—Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Medical Informatics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The disclosure provides a data classification method and device, electronic equipment and a computer readable storage medium, and relates to the technical field of data processing. The method comprises the following steps: collecting target field data and preprocessing the data; converting the target field data into text data, and extracting features of the text data through a pre-trained neural network to obtain a training data set; determining time-varying weights of all sample data in the training data set, and extracting a first sample set from the training data set based on the time-varying weights; for each decision tree, randomly selecting a corresponding feature subset according to the feature importance index to obtain a plurality of second sample sets; and training according to the first sample set and the second sample set through a random forest algorithm to obtain a random forest classifier, so as to realize data classification of the target field data through the random forest classifier. The method and the device effectively improve the prediction performance and accuracy of data classification.
Description
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to a data classification method, a data classification device, an electronic apparatus, and a computer readable storage medium.
Background
With the advent of the cloud age, big data has attracted more and more attention. Big data are commonly used for describing a large amount of unstructured data and semi-structured data created by a company, the massive data contain rich information and value, the processing capacity of the data is improved, and the value-added of the data can be realized.
Disclosure of Invention
An object of an embodiment of the present disclosure is to provide a data classification method, a data classification device, an electronic apparatus, and a computer-readable storage medium, so that processing capacity for large data can be improved to a certain extent, and value-added data can be achieved.
According to a first aspect of the present disclosure, there is provided a data classification method comprising: collecting target field data, and carrying out data preprocessing on the target field data; converting the target field data into text data, and extracting features of the text data through a pre-trained neural network to obtain a training data set of the target field data; determining time-varying weights of all sample data in the training data set, and extracting a first sample set from the training data set based on the time-varying weights; for each decision tree, randomly selecting a corresponding feature subset according to the feature importance index to obtain a plurality of second sample sets; and training the random forest classifier of the target field data through a random forest algorithm according to the first sample set and the second sample set so as to realize data classification of the target field data through the random forest classifier.
In an exemplary embodiment of the disclosure, the performing data preprocessing on the target area data includes: and carrying out missing value processing, abnormal value processing and data transformation processing on the target field data.
In an exemplary embodiment of the present disclosure, the performing missing value processing on the target area data includes: calculating the average value and the mode of the target field data; and filling the missing values of the numerical data in the target field data by using the average value, and filling the missing values of the classification variables in the target field data by using the mode.
In an exemplary embodiment of the disclosure, the performing outlier processing on the target domain data includes: and identifying abnormal values in the target field data through a z-score standardization algorithm, and deleting or filling the abnormal values by using an interpolation method.
In an exemplary embodiment of the present disclosure, the performing data transformation processing on the target area data includes: the target domain data is converted to the same metric by a min-max normalization method.
In one exemplary embodiment of the present disclosure, the pre-trained neural network includes a convolutional neural network and a recurrent neural network; the feature extraction is performed on the text data through a pre-trained neural network to obtain a training data set of the target field data, which comprises the following steps: performing convolution operation on the text data by using a plurality of convolution cores in the convolution neural network, and extracting target features through pooling operation; capturing sequence information of the text data in a cyclic neural network by using a long-short-time memory network and a gating cyclic unit network; and performing feature fusion by using a fully connected neural network to obtain a training data set of the target field data.
In an exemplary embodiment of the disclosure, the determining the time-varying weight of each sample data in the training data set and extracting the first sample set from the training data set based on the time-varying weight includes: sorting each sample data in the training data set according to time, and setting the corresponding time-varying weight for each sample data according to the time sequence; and performing put-back sampling in the training data set according to the time-varying weight to obtain the first sample set.
In an exemplary embodiment of the present disclosure, for each decision tree, randomly selecting a corresponding feature subset according to a feature importance index to obtain a plurality of second sample sets, including: for each decision tree, calculating the characteristic importance index of each sample data in the decision tree, and arranging the sample data in a descending order according to the characteristic importance index; and randomly selecting the second sample set corresponding to the decision tree according to the sorting order of the sample data.
In an exemplary embodiment of the present disclosure, the training according to the first sample set and the second sample set by a random forest algorithm to obtain the random forest classifier of the target area data includes: training to obtain a plurality of decision tree classifiers based on the first sample set and each second sample set, and training the decision tree classifiers by adopting a random forest algorithm to obtain the random forest classifier of the target field data.
According to a second aspect of the present disclosure, there is provided a data sorting apparatus comprising: the data acquisition and preprocessing module is used for acquiring target field data and preprocessing the target field data; the feature extraction module is used for converting the target field data into text data, and extracting features of the text data through a pre-trained neural network to obtain a training data set of the target field data; the first sample set extraction module is used for determining time-varying weights of all sample data in the training data set and extracting the first sample set from the training data set based on the time-varying weights; the second sample set selection module is used for randomly selecting a corresponding feature subset according to the feature importance index aiming at each decision tree to obtain a plurality of second sample sets; the data classification module is used for training the random forest classifier for obtaining the target field data through a random forest algorithm according to the first sample set and the second sample set so as to realize data classification of the target field data through the random forest classifier.
In an exemplary embodiment of the disclosure, the data acquisition and preprocessing module is specifically configured to: and carrying out missing value processing, abnormal value processing and data transformation processing on the target field data.
In an exemplary embodiment of the present disclosure, the first sample set extraction module is specifically configured to: sorting each sample data in the training data set according to time, and setting the corresponding time-varying weight for each sample data according to the time sequence; and performing put-back sampling in the training data set according to the time-varying weight to obtain the first sample set.
In an exemplary embodiment of the present disclosure, the second sample set selection module is specifically configured to: for each decision tree, calculating the characteristic importance index of each sample data in the decision tree, and arranging the sample data in a descending order according to the characteristic importance index; and randomly selecting the second sample set corresponding to the decision tree according to the sorting order of the sample data.
In an exemplary embodiment of the disclosure, the data classification module is specifically configured to: training to obtain a plurality of decision tree classifiers based on the first sample set and each second sample set, and training the decision tree classifiers by adopting a random forest algorithm to obtain the random forest classifier of the target field data.
According to a third aspect of the present disclosure, there is provided an electronic device comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the method of any of the above via execution of the executable instructions.
According to a fourth aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any one of the above.
Exemplary embodiments of the present disclosure may have some or all of the following advantages:
in the data classification method provided by the example embodiment of the present disclosure, target domain data is collected, and data preprocessing is performed on the target domain data; converting the target field data into text data, and extracting features of the text data through a pre-trained neural network to obtain a training data set of the target field data; determining time-varying weights of all sample data in the training data set, and extracting a first sample set from the training data set based on the time-varying weights; for each decision tree, randomly selecting a corresponding feature subset according to the feature importance index to obtain a plurality of second sample sets; and training the random forest classifier of the target field data through a random forest algorithm according to the first sample set and the second sample set so as to realize data classification of the target field data through the random forest classifier. On the one hand, the method and the device convert the target field data into text data, and perform feature extraction on the text data through the pre-trained neural network, so that complex data modes can be captured. On the other hand, the time-varying weight and the characteristic importance index are introduced in the random forest classifier process, so that the prediction performance and accuracy of the model are improved, and the overfitting risk is reduced.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure. It will be apparent to those of ordinary skill in the art that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived from them without undue effort.
FIG. 1 schematically illustrates a flow diagram of a data classification method according to one embodiment of the disclosure;
FIG. 2 schematically illustrates a block diagram of a data sorting apparatus according to one embodiment of the present disclosure;
fig. 3 schematically illustrates a schematic diagram of an electronic device according to one embodiment of the disclosure.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the present disclosure. One skilled in the relevant art will recognize, however, that the aspects of the disclosure may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.
Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor devices and/or microcontroller devices.
The present exemplary embodiment proposes a data classification method, a data classification apparatus, an electronic device, and a computer-readable storage medium. The following describes the technical scheme of the embodiments of the present disclosure in detail:
the present exemplary embodiment first provides a data classification method. Referring to fig. 1, the data classification method specifically includes the following steps:
step S110: collecting target field data, and carrying out data preprocessing on the target field data;
step S120: converting the target field data into text data, and extracting characteristics of the text data through a pre-trained neural network to obtain a training data set of the target field data;
Step S130: determining time-varying weights of all sample data in the training data set, and extracting a first sample set from the training data set based on the time-varying weights;
step S140: for each decision tree, randomly selecting a corresponding feature subset according to the feature importance index to obtain a plurality of second sample sets;
step S150: and training according to the first sample set and the second sample set through a random forest algorithm to obtain a random forest classifier of the target field data, so as to realize data classification of the target field data through the random forest classifier.
In the data classification method provided by the exemplary embodiment of the present disclosure, on one hand, the present disclosure converts target domain data into text data, and performs feature extraction on the text data through a pre-trained neural network, so that a complex data pattern can be captured. On the other hand, the time-varying weight and the characteristic importance index are introduced in the random forest classifier process, so that the prediction performance and accuracy of the model are improved, and the overfitting risk is reduced.
In another embodiment, the above steps are described in more detail below.
In step S110, target area data is collected and data preprocessing is performed on the target area data.
In embodiments of the present disclosure, the above-described target area data may be collected from a plurality of different data sources. The target area may be a financial area, for example. Among them, since the data formats, storage modes, data amounts, and the like of different data sources are different, it is necessary to select an appropriate data acquisition method. Common data collection methods include API interfaces, crawler technology, database connections, etc. In addition, preferably, in the process of collecting data, the data can be subjected to preliminary cleaning and processing, so that abnormal data such as repeated data, missing values and the like are removed, and the quality and the accuracy of the data are ensured.
In an embodiment, the data preprocessing on the target domain data may include performing missing value processing, outlier processing, and data transformation processing on the target domain data. Wherein:
illustratively, the above-described processing of the missing values may be implemented as follows: calculating the average value and the mode of the target field data; the average value is used to fill in the missing value of the numerical data in the target domain data, and the mode is used to fill in the missing value in the classification variable in the target domain data. Specifically:
the processing of the numerical data missing values can be realized by the following formula:
Wherein x is i Representing the ith data sample, n represents the total number of samples, missing represents a missing value.
The classification variable missing value processing can be realized by the following formula:
wherein mode (x 1 ,…,x n ) Representing the value of the most frequently occurring data set, x i Representing the i-th data sample, missing represents a missing value.
Illustratively, the above-described processing of the outlier may be implemented as follows: and identifying abnormal values in the target field data through a z-score standardization algorithm, deleting or filling the abnormal values by using an interpolation method. Wherein the outliers refer to data points that are significantly different from other data. Taking financial big data as an example, outliers may be due to data input errors, measurement errors, or data processing errors, etc. A specific implementation of embodiments of the present disclosure using a Z-score normalization method to identify and process outliers may be as follows:
let x be i Representing the ith data sample, mean and std represent the mean and standard deviation, z, respectively, of the data set i Z-score representing the ith data sample, then:
according to the above formula, z i >3 or z i <-3 the data point is determined as outlier, which is deleted or filled using interpolation method.
Illustratively, the above data transformation process may be implemented as follows: the target domain data is converted to the same metric by a min-max normalization method. Specifically, taking financial big data analysis as an example, data usually has different measurement units or ranges, and data transformation is required. Implementation of the disclosed embodiments using a min-max normalization method to translate data onto the same scale may be as follows:
For the numerical variable x i Normalize it to [0,1 ]]Is within the range of:
wherein min and max respectively represent the variable x i Minimum and maximum values in the dataset, x' i Representing the normalized variables.
It should be noted that the above scenario is only an exemplary illustration, and the scope of the embodiments of the present disclosure is not limited thereto.
In step S120, the target field data is converted into text data, and feature extraction is performed on the text data through a neural network trained in advance, so as to obtain a training data set of the target field data.
In an embodiment of the present disclosure, the neural network may include a convolutional neural network and a cyclic neural network, and the feature extraction of the text data by using the pre-trained neural network may be implemented as follows: performing convolution operation on the text data by using a plurality of convolution cores in the convolution neural network, and extracting target features through pooling operation; capturing sequence information of text data in a cyclic neural network by using a long-short-time memory network and a gating cyclic unit network; and performing feature fusion by using the fully connected neural network to obtain a training data set of the target field data.
In a specific embodiment, taking financial data as an example, the above feature extraction process may be implemented as follows:
s1: the financial data is converted to text format and each word is mapped into a high-dimensional vector space using a word vector model.
S2: in convolutional neural networks, multiple convolution kernels are used to convolve text and the most salient features are extracted by a pooling operation.
Specifically, the convolution operation of CNN (convolutional neural network) is represented using the following formula:
H i =f(W·X i:i+h-1 +b)
wherein X is i:i+h-1 Representing a text segment composed of consecutive h words starting from the ith word, W and b representing the convolution kernel and the bias vector, respectively, and f representing the nonlinear activation function, the activation function employed in the present invention is a ReLU activation function.
S3: in recurrent neural networks, long and short term memory networks (LSTM) and gated loop element networks (GRU) are used to capture sequence information.
Specifically, the forward propagation process of RNNs is represented using the following formula:
h t =f(W ih x t +b ih +W hh h t-1 +b hh )
wherein x is t Word vector representing the t-th word, h t-1 Hidden state vector, W, representing the previous time step ih 、W hh 、b ih And b hh Respectively represent an input weight matrix, a hidden state weight matrix, an input bias vector and Hidden state bias vector, f, represents a nonlinear activation function, and the activation function adopted in the invention is a tanh activation function.
S4: after text features are extracted, a fully connected neural network is used for further feature fusion. Specifically, the forward propagation process of a fully connected neural network is represented using the following formula:
y=f(W 2 f(W 1 h+b 1 )+b 2 )
where h represents a text feature vector, W 1 And W is 2 A weight matrix representing the first layer and the second layer, b 1 And b 2 The bias vectors of the first layer and the second layer respectively, f represents a nonlinear activation function, and the activation function adopted by the invention is a ReLU activation function.
Preferably, the disclosed embodiments may also use a attention mechanism to enhance the degree of attention of the model to critical information, thereby improving the robustness and accuracy of the model. Specifically, the following can be realized:
the calculation of the attention mechanism is represented using the following formula:
wherein alpha is i Attention weight, e, representing the ith text segment i Attention score, h, representing the ith text segment i The feature vector of the ith text segment is represented, n represents the number of text segments, and c represents the feature vector weighted by attention.
Taking the financial field as an example, the feature extraction algorithm of the step can effectively extract useful feature information from financial data and be used for risk identification of the financial data. In addition, the algorithms presented by the embodiments of the present disclosure combine natural language processing techniques with deep neural network models to automatically extract useful information from a large amount of financial data.
In step S130, time-varying weights of each sample data in the training data set are determined, and a first sample set is extracted from the training data set based on the time-varying weights.
In an embodiment of the present disclosure, the time-varying weights are adjusted according to the time positions of the samples in the training dataset. Illustratively, the determining the time-varying weight of each sample data in the training data set and extracting the first sample set from the training data set based on the time-varying weight may be implemented as follows: sorting the sample data in the training data set according to time, and setting corresponding time-varying weights for the sample data according to the time sequence; and performing put-back sampling in the training data set according to the time-varying weight to obtain a first sample set.
In a specific embodiment, the above procedure may be implemented as follows:
let the financial data set obtained after feature extraction beWhere N is the number of samples, x i A feature vector representing the i-th sample, y i Representing its corresponding label. Assuming that the data sets D are arranged in time order, in order to make the model more concerned about recent samples, each sample may be assigned a time-varying weight w i The calculation method is as follows:
where α is an attenuation parameter satisfying 0< α <1. A larger value of alpha indicates a higher degree of interest in recent samples.
From dataset D, according to time-varying weights w i Sampling with put back to obtain self-service sample set D containing N samples m I.e. the first sample set described above.
It should be noted that the above scenario is only an exemplary illustration, and the scope of the embodiments of the present disclosure is not limited thereto.
In step S140, for each decision tree, a corresponding feature subset is randomly selected according to the feature importance index, so as to obtain a plurality of second sample sets.
In the embodiments of the present disclosure, the above-described feature importance index is used to represent the importance of each feature. For example, for each decision tree, the above-mentioned selecting the corresponding feature subset according to the feature importance index at random, and obtaining the plurality of second sample sets may be implemented as follows: calculating characteristic importance indexes of the sample data in the decision tree aiming at each decision tree, and arranging the sample data in a descending order according to the characteristic importance indexes; and randomly selecting a second sample set corresponding to the decision tree according to the sorting order of the sample data.
In a specific embodiment, the above procedure may be implemented as follows:
the feature selection is achieved using an adaptive feature subset selection method. For each decision tree, a subset of features is randomly selected for training. The probability of selection of a feature subset is proportional to the importance of the feature.
Feature importance is calculated by the following formula:
wherein I is f Representing the importance of feature f, T f Representing a set of decision tree nodes comprising a feature f, Δi (T) representing the information gain of node T, |t f I represents T f Is a number of elements of (a).
Selecting feature subset F m I.e. the second sample set, is processed as follows:
s1, calculating the importance I of each feature f 。
S2, arranging all the features in a descending order according to the importance of the features.
S3, randomly selecting a feature subset F with the size of p m The feature selection probability is proportional to the feature importance.
It should be noted that the above scenario is only an exemplary illustration, and the scope of the embodiments of the present disclosure is not limited thereto.
In step S150, a random forest classifier of the target field data is obtained through training by a random forest algorithm according to the first sample set and the second sample set, so as to realize data classification of the target field data by the random forest classifier.
In an embodiment of the present disclosure, after the first sample set and the second sample set are obtained through the above steps, the process of obtaining the random forest classifier of the target area data through training according to the random forest algorithm according to the first sample set and the second sample set may be implemented as follows: training based on the first sample set and each second sample set to obtain a plurality of decision tree classifiers, and training the decision tree classifiers by adopting a random forest algorithm to obtain random forest classifiers of target field data.
In a specific embodiment, the training manner of the random forest algorithm based on time-varying weight and feature selection is as follows:
s1: from dataset D, according to time-varying weights w i Sampling with put back to obtain self-service sample set D containing N samples m 。
S2: the feature subset F is selected using the following method m :
S21: calculating the importance I of each feature f 。
S22: all features are arranged in descending order according to feature importance.
S23: randomly selecting a feature subset F of size p m The feature selection probability is proportional to the feature importance.
S24: using self-service sample set D m And feature subset F m Training a decision tree classifier h m 。
S25: and (3) outputting: random forest classifierWherein I (.cndot.) is an indicator function.
In the embodiment of the disclosure, after the random forest classifier is obtained through training, the data in the target field can be classified based on the random forest classifier. Taking the financial field as an example, the trained model can be applied to classify financial big data, so that analysis of the financial big data is further realized.
The step provides a random forest algorithm based on time-varying weight and feature selection for financial data classification. The algorithm improves predictive performance and reduces the risk of overfitting through dynamic weight allocation and feature selection techniques. Time-varying weight distribution weights are adjusted according to the time position of the sample in the training set, and feature selection techniques select dimensions of the reduced feature space based on the adaptive feature subsets.
It should be noted that although the steps of the methods in the present disclosure are depicted in the accompanying drawings in a particular order, this does not require or imply that the steps must be performed in that particular order, or that all illustrated steps be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc.
Further, in this example embodiment, a data classification apparatus is provided, and referring to fig. 2, the data classification apparatus 200 may include a data acquisition and preprocessing module 210, a feature extraction module 220, a first sample set extraction module 230, a second sample set selection module 240, and a data classification module 250. Wherein:
the data acquisition and preprocessing module 210 may be configured to acquire target domain data and perform data preprocessing on the target domain data;
the feature extraction module 220 may be configured to convert the target field data into text data, and perform feature extraction on the text data through a neural network trained in advance to obtain a training dataset of the target field data;
The first sample set extraction module 230 may be configured to determine a time-varying weight of each sample data in the training data set, and extract the first sample set from the training data set based on the time-varying weight;
the second sample set selection module 240 may be configured to randomly select, for each decision tree, a corresponding feature subset according to a feature importance index, to obtain a plurality of second sample sets;
the data classification module 250 may be configured to obtain a random forest classifier of the target field data through training by a random forest algorithm according to the first sample set and the second sample set, so as to implement data classification of the target field data by the random forest classifier.
Details of the specific implementation of the data classification device are already described in detail at the corresponding positions of the data classification method, so that details are not repeated here.
It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.
Fig. 3 is a schematic structural diagram of an electronic device in an embodiment of the disclosure. Referring now in particular to fig. 3, a schematic diagram of an electronic device 300 suitable for use in implementing embodiments of the present disclosure is shown. The electronic device shown in fig. 3 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.
As shown in fig. 3, the electronic device 300 may include a processing means (e.g., a central processor, a graphics processor, etc.) 301 that may perform various suitable actions and processes to implement the data classification method of embodiments as described in the present disclosure, according to a program stored in a Read Only Memory (ROM) 302 or a program loaded from a storage means 308 into a Random Access Memory (RAM) 303. In the RAM 303, various programs and data required for the operation of the electronic apparatus 300 are also stored. The processing device 301, the ROM 302, and the RAM 303 are connected to each other via a bus 304. An input/output (I/O) interface 305 is also connected to bus 304.
In general, the following devices may be connected to the I/O interface 305: input devices 306 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 307 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 308 including, for example, magnetic tape, hard disk, etc.; and communication means 309. The communication means 309 may allow the electronic device 300 to communicate with other devices wirelessly or by wire to exchange data. While fig. 3 shows an electronic device 300 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.
In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts, thereby implementing the data classification method as described above. In such an embodiment, the computer program may be downloaded and installed from a network via a communication device 309, or installed from a storage device 308, or installed from a ROM 302. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing means 301.
It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.
In some implementations, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.
The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to:
collecting target field data, and carrying out data preprocessing on the target field data;
converting the target field data into text data, and extracting characteristics of the text data through a pre-trained neural network to obtain a training data set of the target field data;
determining time-varying weights of all sample data in the training data set, and extracting a first sample set from the training data set based on the time-varying weights;
For each decision tree, randomly selecting a corresponding feature subset according to the feature importance index to obtain a plurality of second sample sets;
and training according to the first sample set and the second sample set through a random forest algorithm to obtain a random forest classifier of the target field data, so as to realize data classification of the target field data through the random forest classifier.
Alternatively, the electronic device may perform other steps described in the above embodiments when the above one or more programs are executed by the electronic device.
Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. Wherein the names of the units do not constitute a limitation of the units themselves in some cases.
The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).
Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.
Claims (16)
1. A method of classifying data, comprising:
collecting target field data, and carrying out data preprocessing on the target field data;
converting the target field data into text data, and extracting features of the text data through a pre-trained neural network to obtain a training data set of the target field data;
determining time-varying weights of all sample data in the training data set, and extracting a first sample set from the training data set based on the time-varying weights;
for each decision tree, randomly selecting a corresponding feature subset according to the feature importance index to obtain a plurality of second sample sets;
and training the random forest classifier of the target field data through a random forest algorithm according to the first sample set and the second sample set so as to realize data classification of the target field data through the random forest classifier.
2. The method of claim 1, wherein the data preprocessing the target area data comprises:
and carrying out missing value processing, abnormal value processing and data transformation processing on the target field data.
3. The data classification method according to claim 2, wherein the performing missing value processing on the target area data includes:
calculating the average value and the mode of the target field data;
and filling the missing values of the numerical data in the target field data by using the average value, and filling the missing values of the classification variables in the target field data by using the mode.
4. The data classification method according to claim 2, wherein the performing outlier processing on the target area data includes:
and identifying abnormal values in the target field data through a z-score standardization algorithm, and deleting or filling the abnormal values by using an interpolation method.
5. The data classification method according to claim 2, wherein the performing data transformation processing on the target area data includes:
the target domain data is converted to the same metric by a min-max normalization method.
6. The data classification method of claim 1, wherein the pre-trained neural network comprises a convolutional neural network and a recurrent neural network;
the feature extraction is performed on the text data through a pre-trained neural network to obtain a training data set of the target field data, which comprises the following steps:
performing convolution operation on the text data by using a plurality of convolution cores in the convolution neural network, and extracting target features through pooling operation;
capturing sequence information of the text data in a cyclic neural network by using a long-short-time memory network and a gating cyclic unit network;
and performing feature fusion by using a fully connected neural network to obtain a training data set of the target field data.
7. The method of data classification according to claim 1, wherein said determining time-varying weights for each sample data in the training dataset and extracting a first sample set from the training dataset based on the time-varying weights comprises:
sorting each sample data in the training data set according to time, and setting the corresponding time-varying weight for each sample data according to the time sequence;
And performing put-back sampling in the training data set according to the time-varying weight to obtain the first sample set.
8. The method of claim 7, wherein for each decision tree, a corresponding feature subset is randomly selected according to a feature importance index to obtain a plurality of second sample sets, including:
for each decision tree, calculating the characteristic importance index of each sample data in the decision tree, and arranging the sample data in a descending order according to the characteristic importance index;
and randomly selecting the second sample set corresponding to the decision tree according to the sorting order of the sample data.
9. The method of claim 8, wherein the training the random forest classifier based on the first sample set and the second sample set by a random forest algorithm to obtain the target area data comprises:
training to obtain a plurality of decision tree classifiers based on the first sample set and each second sample set, and training the decision tree classifiers by adopting a random forest algorithm to obtain the random forest classifier of the target field data.
10. A data sorting apparatus, comprising:
the data acquisition and preprocessing module is used for acquiring target field data and preprocessing the target field data;
the feature extraction module is used for converting the target field data into text data, and extracting features of the text data through a pre-trained neural network to obtain a training data set of the target field data;
the first sample set extraction module is used for determining time-varying weights of all sample data in the training data set and extracting the first sample set from the training data set based on the time-varying weights;
the second sample set selection module is used for randomly selecting a corresponding feature subset according to the feature importance index aiming at each decision tree to obtain a plurality of second sample sets;
the data classification module is used for training the random forest classifier for obtaining the target field data through a random forest algorithm according to the first sample set and the second sample set so as to realize data classification of the target field data through the random forest classifier.
11. The data classification device of claim 10, wherein the data acquisition and preprocessing module is specifically configured to:
And carrying out missing value processing, abnormal value processing and data transformation processing on the target field data.
12. The data classification device of claim 10, wherein the first sample set extraction module is specifically configured to:
sorting each sample data in the training data set according to time, and setting the corresponding time-varying weight for each sample data according to the time sequence;
and performing put-back sampling in the training data set according to the time-varying weight to obtain the first sample set.
13. The data classification device of claim 12, wherein the second sample set selection module is specifically configured to:
for each decision tree, calculating the characteristic importance index of each sample data in the decision tree, and arranging the sample data in a descending order according to the characteristic importance index;
and randomly selecting the second sample set corresponding to the decision tree according to the sorting order of the sample data.
14. The data classification device of claim 13, wherein the data classification module is specifically configured to:
training to obtain a plurality of decision tree classifiers based on the first sample set and each second sample set, and training the decision tree classifiers by adopting a random forest algorithm to obtain the random forest classifier of the target field data.
15. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method of any of claims 1-9.
16. An electronic device, comprising:
a processor;
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform the method of any of claims 1-9 via execution of the executable instructions.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311304121.0A CN117473421A (en) | 2023-10-09 | 2023-10-09 | Data classification method and device, storage medium and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311304121.0A CN117473421A (en) | 2023-10-09 | 2023-10-09 | Data classification method and device, storage medium and electronic equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117473421A true CN117473421A (en) | 2024-01-30 |
Family
ID=89624663
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311304121.0A Pending CN117473421A (en) | 2023-10-09 | 2023-10-09 | Data classification method and device, storage medium and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117473421A (en) |
-
2023
- 2023-10-09 CN CN202311304121.0A patent/CN117473421A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109344908B (en) | Method and apparatus for generating a model | |
Shanthamallu et al. | A brief survey of machine learning methods and their sensor and IoT applications | |
EP3940591A1 (en) | Image generating method, neural network compression method, and related apparatus and device | |
CN110347873B (en) | Video classification method and device, electronic equipment and storage medium | |
EP3340129B1 (en) | Artificial neural network class-based pruning | |
US20190294975A1 (en) | Predicting using digital twins | |
CN113435253B (en) | Multi-source image combined urban area ground surface coverage classification method | |
CN112994701B (en) | Data compression method, device, electronic equipment and computer readable medium | |
CN111062431A (en) | Image clustering method, image clustering device, electronic device, and storage medium | |
CN110708285A (en) | Flow monitoring method, device, medium and electronic equipment | |
EP4343616A1 (en) | Image classification method, model training method, device, storage medium, and computer program | |
CN115879508A (en) | Data processing method and related device | |
CN114511733A (en) | Fine-grained image identification method and device based on weak supervised learning and readable medium | |
CN111159481A (en) | Edge prediction method and device of graph data and terminal equipment | |
WO2024051655A1 (en) | Method and apparatus for processing histopathological whole-slide image, and medium and electronic device | |
CN115294405B (en) | Method, device, equipment and medium for constructing crop disease classification model | |
CN114565794A (en) | Bearing fault classification method, device, equipment and storage medium | |
Kundu et al. | Optimal Machine Learning Based Automated Malaria Parasite Detection and Classification Model Using Blood Smear Images. | |
CN117473421A (en) | Data classification method and device, storage medium and electronic equipment | |
CN114387465A (en) | Image recognition method and device, electronic equipment and computer readable medium | |
CN114494933A (en) | Hydrology monitoring station image recognition monitoring system based on edge intelligence | |
EP3683733A1 (en) | A method, an apparatus and a computer program product for neural networks | |
CN117636100B (en) | Pre-training task model adjustment processing method and device, electronic equipment and medium | |
CN111898658B (en) | Image classification method and device and electronic equipment | |
CN113140012B (en) | Image processing method, device, medium and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |