CN117473421A - Data classification method and device, storage medium and electronic equipment - Google Patents

Data classification method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN117473421A
CN117473421A CN202311304121.0A CN202311304121A CN117473421A CN 117473421 A CN117473421 A CN 117473421A CN 202311304121 A CN202311304121 A CN 202311304121A CN 117473421 A CN117473421 A CN 117473421A
Authority
CN
China
Prior art keywords
data
sample
training
target field
sample set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311304121.0A
Other languages
Chinese (zh)
Inventor
柯晨怡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Construction Bank Corp
CCB Finetech Co Ltd
Original Assignee
China Construction Bank Corp
CCB Finetech Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Bank Corp, CCB Finetech Co Ltd filed Critical China Construction Bank Corp
Priority to CN202311304121.0A priority Critical patent/CN117473421A/en
Publication of CN117473421A publication Critical patent/CN117473421A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/01Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure provides a data classification method and device, electronic equipment and a computer readable storage medium, and relates to the technical field of data processing. The method comprises the following steps: collecting target field data and preprocessing the data; converting the target field data into text data, and extracting features of the text data through a pre-trained neural network to obtain a training data set; determining time-varying weights of all sample data in the training data set, and extracting a first sample set from the training data set based on the time-varying weights; for each decision tree, randomly selecting a corresponding feature subset according to the feature importance index to obtain a plurality of second sample sets; and training according to the first sample set and the second sample set through a random forest algorithm to obtain a random forest classifier, so as to realize data classification of the target field data through the random forest classifier. The method and the device effectively improve the prediction performance and accuracy of data classification.

Description

Data classification method and device, storage medium and electronic equipment
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to a data classification method, a data classification device, an electronic apparatus, and a computer readable storage medium.
Background
With the advent of the cloud age, big data has attracted more and more attention. Big data are commonly used for describing a large amount of unstructured data and semi-structured data created by a company, the massive data contain rich information and value, the processing capacity of the data is improved, and the value-added of the data can be realized.
Disclosure of Invention
An object of an embodiment of the present disclosure is to provide a data classification method, a data classification device, an electronic apparatus, and a computer-readable storage medium, so that processing capacity for large data can be improved to a certain extent, and value-added data can be achieved.
According to a first aspect of the present disclosure, there is provided a data classification method comprising: collecting target field data, and carrying out data preprocessing on the target field data; converting the target field data into text data, and extracting features of the text data through a pre-trained neural network to obtain a training data set of the target field data; determining time-varying weights of all sample data in the training data set, and extracting a first sample set from the training data set based on the time-varying weights; for each decision tree, randomly selecting a corresponding feature subset according to the feature importance index to obtain a plurality of second sample sets; and training the random forest classifier of the target field data through a random forest algorithm according to the first sample set and the second sample set so as to realize data classification of the target field data through the random forest classifier.
In an exemplary embodiment of the disclosure, the performing data preprocessing on the target area data includes: and carrying out missing value processing, abnormal value processing and data transformation processing on the target field data.
In an exemplary embodiment of the present disclosure, the performing missing value processing on the target area data includes: calculating the average value and the mode of the target field data; and filling the missing values of the numerical data in the target field data by using the average value, and filling the missing values of the classification variables in the target field data by using the mode.
In an exemplary embodiment of the disclosure, the performing outlier processing on the target domain data includes: and identifying abnormal values in the target field data through a z-score standardization algorithm, and deleting or filling the abnormal values by using an interpolation method.
In an exemplary embodiment of the present disclosure, the performing data transformation processing on the target area data includes: the target domain data is converted to the same metric by a min-max normalization method.
In one exemplary embodiment of the present disclosure, the pre-trained neural network includes a convolutional neural network and a recurrent neural network; the feature extraction is performed on the text data through a pre-trained neural network to obtain a training data set of the target field data, which comprises the following steps: performing convolution operation on the text data by using a plurality of convolution cores in the convolution neural network, and extracting target features through pooling operation; capturing sequence information of the text data in a cyclic neural network by using a long-short-time memory network and a gating cyclic unit network; and performing feature fusion by using a fully connected neural network to obtain a training data set of the target field data.
In an exemplary embodiment of the disclosure, the determining the time-varying weight of each sample data in the training data set and extracting the first sample set from the training data set based on the time-varying weight includes: sorting each sample data in the training data set according to time, and setting the corresponding time-varying weight for each sample data according to the time sequence; and performing put-back sampling in the training data set according to the time-varying weight to obtain the first sample set.
In an exemplary embodiment of the present disclosure, for each decision tree, randomly selecting a corresponding feature subset according to a feature importance index to obtain a plurality of second sample sets, including: for each decision tree, calculating the characteristic importance index of each sample data in the decision tree, and arranging the sample data in a descending order according to the characteristic importance index; and randomly selecting the second sample set corresponding to the decision tree according to the sorting order of the sample data.
In an exemplary embodiment of the present disclosure, the training according to the first sample set and the second sample set by a random forest algorithm to obtain the random forest classifier of the target area data includes: training to obtain a plurality of decision tree classifiers based on the first sample set and each second sample set, and training the decision tree classifiers by adopting a random forest algorithm to obtain the random forest classifier of the target field data.
According to a second aspect of the present disclosure, there is provided a data sorting apparatus comprising: the data acquisition and preprocessing module is used for acquiring target field data and preprocessing the target field data; the feature extraction module is used for converting the target field data into text data, and extracting features of the text data through a pre-trained neural network to obtain a training data set of the target field data; the first sample set extraction module is used for determining time-varying weights of all sample data in the training data set and extracting the first sample set from the training data set based on the time-varying weights; the second sample set selection module is used for randomly selecting a corresponding feature subset according to the feature importance index aiming at each decision tree to obtain a plurality of second sample sets; the data classification module is used for training the random forest classifier for obtaining the target field data through a random forest algorithm according to the first sample set and the second sample set so as to realize data classification of the target field data through the random forest classifier.
In an exemplary embodiment of the disclosure, the data acquisition and preprocessing module is specifically configured to: and carrying out missing value processing, abnormal value processing and data transformation processing on the target field data.
In an exemplary embodiment of the present disclosure, the first sample set extraction module is specifically configured to: sorting each sample data in the training data set according to time, and setting the corresponding time-varying weight for each sample data according to the time sequence; and performing put-back sampling in the training data set according to the time-varying weight to obtain the first sample set.
In an exemplary embodiment of the present disclosure, the second sample set selection module is specifically configured to: for each decision tree, calculating the characteristic importance index of each sample data in the decision tree, and arranging the sample data in a descending order according to the characteristic importance index; and randomly selecting the second sample set corresponding to the decision tree according to the sorting order of the sample data.
In an exemplary embodiment of the disclosure, the data classification module is specifically configured to: training to obtain a plurality of decision tree classifiers based on the first sample set and each second sample set, and training the decision tree classifiers by adopting a random forest algorithm to obtain the random forest classifier of the target field data.
According to a third aspect of the present disclosure, there is provided an electronic device comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the method of any of the above via execution of the executable instructions.
According to a fourth aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any one of the above.
Exemplary embodiments of the present disclosure may have some or all of the following advantages:
in the data classification method provided by the example embodiment of the present disclosure, target domain data is collected, and data preprocessing is performed on the target domain data; converting the target field data into text data, and extracting features of the text data through a pre-trained neural network to obtain a training data set of the target field data; determining time-varying weights of all sample data in the training data set, and extracting a first sample set from the training data set based on the time-varying weights; for each decision tree, randomly selecting a corresponding feature subset according to the feature importance index to obtain a plurality of second sample sets; and training the random forest classifier of the target field data through a random forest algorithm according to the first sample set and the second sample set so as to realize data classification of the target field data through the random forest classifier. On the one hand, the method and the device convert the target field data into text data, and perform feature extraction on the text data through the pre-trained neural network, so that complex data modes can be captured. On the other hand, the time-varying weight and the characteristic importance index are introduced in the random forest classifier process, so that the prediction performance and accuracy of the model are improved, and the overfitting risk is reduced.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure. It will be apparent to those of ordinary skill in the art that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived from them without undue effort.
FIG. 1 schematically illustrates a flow diagram of a data classification method according to one embodiment of the disclosure;
FIG. 2 schematically illustrates a block diagram of a data sorting apparatus according to one embodiment of the present disclosure;
fig. 3 schematically illustrates a schematic diagram of an electronic device according to one embodiment of the disclosure.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the present disclosure. One skilled in the relevant art will recognize, however, that the aspects of the disclosure may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.
Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor devices and/or microcontroller devices.
The present exemplary embodiment proposes a data classification method, a data classification apparatus, an electronic device, and a computer-readable storage medium. The following describes the technical scheme of the embodiments of the present disclosure in detail:
the present exemplary embodiment first provides a data classification method. Referring to fig. 1, the data classification method specifically includes the following steps:
step S110: collecting target field data, and carrying out data preprocessing on the target field data;
step S120: converting the target field data into text data, and extracting characteristics of the text data through a pre-trained neural network to obtain a training data set of the target field data;
Step S130: determining time-varying weights of all sample data in the training data set, and extracting a first sample set from the training data set based on the time-varying weights;
step S140: for each decision tree, randomly selecting a corresponding feature subset according to the feature importance index to obtain a plurality of second sample sets;
step S150: and training according to the first sample set and the second sample set through a random forest algorithm to obtain a random forest classifier of the target field data, so as to realize data classification of the target field data through the random forest classifier.
In the data classification method provided by the exemplary embodiment of the present disclosure, on one hand, the present disclosure converts target domain data into text data, and performs feature extraction on the text data through a pre-trained neural network, so that a complex data pattern can be captured. On the other hand, the time-varying weight and the characteristic importance index are introduced in the random forest classifier process, so that the prediction performance and accuracy of the model are improved, and the overfitting risk is reduced.
In another embodiment, the above steps are described in more detail below.
In step S110, target area data is collected and data preprocessing is performed on the target area data.
In embodiments of the present disclosure, the above-described target area data may be collected from a plurality of different data sources. The target area may be a financial area, for example. Among them, since the data formats, storage modes, data amounts, and the like of different data sources are different, it is necessary to select an appropriate data acquisition method. Common data collection methods include API interfaces, crawler technology, database connections, etc. In addition, preferably, in the process of collecting data, the data can be subjected to preliminary cleaning and processing, so that abnormal data such as repeated data, missing values and the like are removed, and the quality and the accuracy of the data are ensured.
In an embodiment, the data preprocessing on the target domain data may include performing missing value processing, outlier processing, and data transformation processing on the target domain data. Wherein:
illustratively, the above-described processing of the missing values may be implemented as follows: calculating the average value and the mode of the target field data; the average value is used to fill in the missing value of the numerical data in the target domain data, and the mode is used to fill in the missing value in the classification variable in the target domain data. Specifically:
the processing of the numerical data missing values can be realized by the following formula:
Wherein x is i Representing the ith data sample, n represents the total number of samples, missing represents a missing value.
The classification variable missing value processing can be realized by the following formula:
wherein mode (x 1 ,…,x n ) Representing the value of the most frequently occurring data set, x i Representing the i-th data sample, missing represents a missing value.
Illustratively, the above-described processing of the outlier may be implemented as follows: and identifying abnormal values in the target field data through a z-score standardization algorithm, deleting or filling the abnormal values by using an interpolation method. Wherein the outliers refer to data points that are significantly different from other data. Taking financial big data as an example, outliers may be due to data input errors, measurement errors, or data processing errors, etc. A specific implementation of embodiments of the present disclosure using a Z-score normalization method to identify and process outliers may be as follows:
let x be i Representing the ith data sample, mean and std represent the mean and standard deviation, z, respectively, of the data set i Z-score representing the ith data sample, then:
according to the above formula, z i >3 or z i <-3 the data point is determined as outlier, which is deleted or filled using interpolation method.
Illustratively, the above data transformation process may be implemented as follows: the target domain data is converted to the same metric by a min-max normalization method. Specifically, taking financial big data analysis as an example, data usually has different measurement units or ranges, and data transformation is required. Implementation of the disclosed embodiments using a min-max normalization method to translate data onto the same scale may be as follows:
For the numerical variable x i Normalize it to [0,1 ]]Is within the range of:
wherein min and max respectively represent the variable x i Minimum and maximum values in the dataset, x' i Representing the normalized variables.
It should be noted that the above scenario is only an exemplary illustration, and the scope of the embodiments of the present disclosure is not limited thereto.
In step S120, the target field data is converted into text data, and feature extraction is performed on the text data through a neural network trained in advance, so as to obtain a training data set of the target field data.
In an embodiment of the present disclosure, the neural network may include a convolutional neural network and a cyclic neural network, and the feature extraction of the text data by using the pre-trained neural network may be implemented as follows: performing convolution operation on the text data by using a plurality of convolution cores in the convolution neural network, and extracting target features through pooling operation; capturing sequence information of text data in a cyclic neural network by using a long-short-time memory network and a gating cyclic unit network; and performing feature fusion by using the fully connected neural network to obtain a training data set of the target field data.
In a specific embodiment, taking financial data as an example, the above feature extraction process may be implemented as follows:
s1: the financial data is converted to text format and each word is mapped into a high-dimensional vector space using a word vector model.
S2: in convolutional neural networks, multiple convolution kernels are used to convolve text and the most salient features are extracted by a pooling operation.
Specifically, the convolution operation of CNN (convolutional neural network) is represented using the following formula:
H i =f(W·X i:i+h-1 +b)
wherein X is i:i+h-1 Representing a text segment composed of consecutive h words starting from the ith word, W and b representing the convolution kernel and the bias vector, respectively, and f representing the nonlinear activation function, the activation function employed in the present invention is a ReLU activation function.
S3: in recurrent neural networks, long and short term memory networks (LSTM) and gated loop element networks (GRU) are used to capture sequence information.
Specifically, the forward propagation process of RNNs is represented using the following formula:
h t =f(W ih x t +b ih +W hh h t-1 +b hh )
wherein x is t Word vector representing the t-th word, h t-1 Hidden state vector, W, representing the previous time step ih 、W hh 、b ih And b hh Respectively represent an input weight matrix, a hidden state weight matrix, an input bias vector and Hidden state bias vector, f, represents a nonlinear activation function, and the activation function adopted in the invention is a tanh activation function.
S4: after text features are extracted, a fully connected neural network is used for further feature fusion. Specifically, the forward propagation process of a fully connected neural network is represented using the following formula:
y=f(W 2 f(W 1 h+b 1 )+b 2 )
where h represents a text feature vector, W 1 And W is 2 A weight matrix representing the first layer and the second layer, b 1 And b 2 The bias vectors of the first layer and the second layer respectively, f represents a nonlinear activation function, and the activation function adopted by the invention is a ReLU activation function.
Preferably, the disclosed embodiments may also use a attention mechanism to enhance the degree of attention of the model to critical information, thereby improving the robustness and accuracy of the model. Specifically, the following can be realized:
the calculation of the attention mechanism is represented using the following formula:
wherein alpha is i Attention weight, e, representing the ith text segment i Attention score, h, representing the ith text segment i The feature vector of the ith text segment is represented, n represents the number of text segments, and c represents the feature vector weighted by attention.
Taking the financial field as an example, the feature extraction algorithm of the step can effectively extract useful feature information from financial data and be used for risk identification of the financial data. In addition, the algorithms presented by the embodiments of the present disclosure combine natural language processing techniques with deep neural network models to automatically extract useful information from a large amount of financial data.
In step S130, time-varying weights of each sample data in the training data set are determined, and a first sample set is extracted from the training data set based on the time-varying weights.
In an embodiment of the present disclosure, the time-varying weights are adjusted according to the time positions of the samples in the training dataset. Illustratively, the determining the time-varying weight of each sample data in the training data set and extracting the first sample set from the training data set based on the time-varying weight may be implemented as follows: sorting the sample data in the training data set according to time, and setting corresponding time-varying weights for the sample data according to the time sequence; and performing put-back sampling in the training data set according to the time-varying weight to obtain a first sample set.
In a specific embodiment, the above procedure may be implemented as follows:
let the financial data set obtained after feature extraction beWhere N is the number of samples, x i A feature vector representing the i-th sample, y i Representing its corresponding label. Assuming that the data sets D are arranged in time order, in order to make the model more concerned about recent samples, each sample may be assigned a time-varying weight w i The calculation method is as follows:
where α is an attenuation parameter satisfying 0< α <1. A larger value of alpha indicates a higher degree of interest in recent samples.
From dataset D, according to time-varying weights w i Sampling with put back to obtain self-service sample set D containing N samples m I.e. the first sample set described above.
It should be noted that the above scenario is only an exemplary illustration, and the scope of the embodiments of the present disclosure is not limited thereto.
In step S140, for each decision tree, a corresponding feature subset is randomly selected according to the feature importance index, so as to obtain a plurality of second sample sets.
In the embodiments of the present disclosure, the above-described feature importance index is used to represent the importance of each feature. For example, for each decision tree, the above-mentioned selecting the corresponding feature subset according to the feature importance index at random, and obtaining the plurality of second sample sets may be implemented as follows: calculating characteristic importance indexes of the sample data in the decision tree aiming at each decision tree, and arranging the sample data in a descending order according to the characteristic importance indexes; and randomly selecting a second sample set corresponding to the decision tree according to the sorting order of the sample data.
In a specific embodiment, the above procedure may be implemented as follows:
the feature selection is achieved using an adaptive feature subset selection method. For each decision tree, a subset of features is randomly selected for training. The probability of selection of a feature subset is proportional to the importance of the feature.
Feature importance is calculated by the following formula:
wherein I is f Representing the importance of feature f, T f Representing a set of decision tree nodes comprising a feature f, Δi (T) representing the information gain of node T, |t f I represents T f Is a number of elements of (a).
Selecting feature subset F m I.e. the second sample set, is processed as follows:
s1, calculating the importance I of each feature f
S2, arranging all the features in a descending order according to the importance of the features.
S3, randomly selecting a feature subset F with the size of p m The feature selection probability is proportional to the feature importance.
It should be noted that the above scenario is only an exemplary illustration, and the scope of the embodiments of the present disclosure is not limited thereto.
In step S150, a random forest classifier of the target field data is obtained through training by a random forest algorithm according to the first sample set and the second sample set, so as to realize data classification of the target field data by the random forest classifier.
In an embodiment of the present disclosure, after the first sample set and the second sample set are obtained through the above steps, the process of obtaining the random forest classifier of the target area data through training according to the random forest algorithm according to the first sample set and the second sample set may be implemented as follows: training based on the first sample set and each second sample set to obtain a plurality of decision tree classifiers, and training the decision tree classifiers by adopting a random forest algorithm to obtain random forest classifiers of target field data.
In a specific embodiment, the training manner of the random forest algorithm based on time-varying weight and feature selection is as follows:
s1: from dataset D, according to time-varying weights w i Sampling with put back to obtain self-service sample set D containing N samples m
S2: the feature subset F is selected using the following method m
S21: calculating the importance I of each feature f
S22: all features are arranged in descending order according to feature importance.
S23: randomly selecting a feature subset F of size p m The feature selection probability is proportional to the feature importance.
S24: using self-service sample set D m And feature subset F m Training a decision tree classifier h m
S25: and (3) outputting: random forest classifierWherein I (.cndot.) is an indicator function.
In the embodiment of the disclosure, after the random forest classifier is obtained through training, the data in the target field can be classified based on the random forest classifier. Taking the financial field as an example, the trained model can be applied to classify financial big data, so that analysis of the financial big data is further realized.
The step provides a random forest algorithm based on time-varying weight and feature selection for financial data classification. The algorithm improves predictive performance and reduces the risk of overfitting through dynamic weight allocation and feature selection techniques. Time-varying weight distribution weights are adjusted according to the time position of the sample in the training set, and feature selection techniques select dimensions of the reduced feature space based on the adaptive feature subsets.
It should be noted that although the steps of the methods in the present disclosure are depicted in the accompanying drawings in a particular order, this does not require or imply that the steps must be performed in that particular order, or that all illustrated steps be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc.
Further, in this example embodiment, a data classification apparatus is provided, and referring to fig. 2, the data classification apparatus 200 may include a data acquisition and preprocessing module 210, a feature extraction module 220, a first sample set extraction module 230, a second sample set selection module 240, and a data classification module 250. Wherein:
the data acquisition and preprocessing module 210 may be configured to acquire target domain data and perform data preprocessing on the target domain data;
the feature extraction module 220 may be configured to convert the target field data into text data, and perform feature extraction on the text data through a neural network trained in advance to obtain a training dataset of the target field data;
The first sample set extraction module 230 may be configured to determine a time-varying weight of each sample data in the training data set, and extract the first sample set from the training data set based on the time-varying weight;
the second sample set selection module 240 may be configured to randomly select, for each decision tree, a corresponding feature subset according to a feature importance index, to obtain a plurality of second sample sets;
the data classification module 250 may be configured to obtain a random forest classifier of the target field data through training by a random forest algorithm according to the first sample set and the second sample set, so as to implement data classification of the target field data by the random forest classifier.
Details of the specific implementation of the data classification device are already described in detail at the corresponding positions of the data classification method, so that details are not repeated here.
It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.
Fig. 3 is a schematic structural diagram of an electronic device in an embodiment of the disclosure. Referring now in particular to fig. 3, a schematic diagram of an electronic device 300 suitable for use in implementing embodiments of the present disclosure is shown. The electronic device shown in fig. 3 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.
As shown in fig. 3, the electronic device 300 may include a processing means (e.g., a central processor, a graphics processor, etc.) 301 that may perform various suitable actions and processes to implement the data classification method of embodiments as described in the present disclosure, according to a program stored in a Read Only Memory (ROM) 302 or a program loaded from a storage means 308 into a Random Access Memory (RAM) 303. In the RAM 303, various programs and data required for the operation of the electronic apparatus 300 are also stored. The processing device 301, the ROM 302, and the RAM 303 are connected to each other via a bus 304. An input/output (I/O) interface 305 is also connected to bus 304.
In general, the following devices may be connected to the I/O interface 305: input devices 306 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 307 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 308 including, for example, magnetic tape, hard disk, etc.; and communication means 309. The communication means 309 may allow the electronic device 300 to communicate with other devices wirelessly or by wire to exchange data. While fig. 3 shows an electronic device 300 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.
In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts, thereby implementing the data classification method as described above. In such an embodiment, the computer program may be downloaded and installed from a network via a communication device 309, or installed from a storage device 308, or installed from a ROM 302. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing means 301.
It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.
In some implementations, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.
The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to:
collecting target field data, and carrying out data preprocessing on the target field data;
converting the target field data into text data, and extracting characteristics of the text data through a pre-trained neural network to obtain a training data set of the target field data;
determining time-varying weights of all sample data in the training data set, and extracting a first sample set from the training data set based on the time-varying weights;
For each decision tree, randomly selecting a corresponding feature subset according to the feature importance index to obtain a plurality of second sample sets;
and training according to the first sample set and the second sample set through a random forest algorithm to obtain a random forest classifier of the target field data, so as to realize data classification of the target field data through the random forest classifier.
Alternatively, the electronic device may perform other steps described in the above embodiments when the above one or more programs are executed by the electronic device.
Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. Wherein the names of the units do not constitute a limitation of the units themselves in some cases.
The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).
Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims.

Claims (16)

1. A method of classifying data, comprising:
collecting target field data, and carrying out data preprocessing on the target field data;
converting the target field data into text data, and extracting features of the text data through a pre-trained neural network to obtain a training data set of the target field data;
determining time-varying weights of all sample data in the training data set, and extracting a first sample set from the training data set based on the time-varying weights;
for each decision tree, randomly selecting a corresponding feature subset according to the feature importance index to obtain a plurality of second sample sets;
and training the random forest classifier of the target field data through a random forest algorithm according to the first sample set and the second sample set so as to realize data classification of the target field data through the random forest classifier.
2. The method of claim 1, wherein the data preprocessing the target area data comprises:
and carrying out missing value processing, abnormal value processing and data transformation processing on the target field data.
3. The data classification method according to claim 2, wherein the performing missing value processing on the target area data includes:
calculating the average value and the mode of the target field data;
and filling the missing values of the numerical data in the target field data by using the average value, and filling the missing values of the classification variables in the target field data by using the mode.
4. The data classification method according to claim 2, wherein the performing outlier processing on the target area data includes:
and identifying abnormal values in the target field data through a z-score standardization algorithm, and deleting or filling the abnormal values by using an interpolation method.
5. The data classification method according to claim 2, wherein the performing data transformation processing on the target area data includes:
the target domain data is converted to the same metric by a min-max normalization method.
6. The data classification method of claim 1, wherein the pre-trained neural network comprises a convolutional neural network and a recurrent neural network;
the feature extraction is performed on the text data through a pre-trained neural network to obtain a training data set of the target field data, which comprises the following steps:
performing convolution operation on the text data by using a plurality of convolution cores in the convolution neural network, and extracting target features through pooling operation;
capturing sequence information of the text data in a cyclic neural network by using a long-short-time memory network and a gating cyclic unit network;
and performing feature fusion by using a fully connected neural network to obtain a training data set of the target field data.
7. The method of data classification according to claim 1, wherein said determining time-varying weights for each sample data in the training dataset and extracting a first sample set from the training dataset based on the time-varying weights comprises:
sorting each sample data in the training data set according to time, and setting the corresponding time-varying weight for each sample data according to the time sequence;
And performing put-back sampling in the training data set according to the time-varying weight to obtain the first sample set.
8. The method of claim 7, wherein for each decision tree, a corresponding feature subset is randomly selected according to a feature importance index to obtain a plurality of second sample sets, including:
for each decision tree, calculating the characteristic importance index of each sample data in the decision tree, and arranging the sample data in a descending order according to the characteristic importance index;
and randomly selecting the second sample set corresponding to the decision tree according to the sorting order of the sample data.
9. The method of claim 8, wherein the training the random forest classifier based on the first sample set and the second sample set by a random forest algorithm to obtain the target area data comprises:
training to obtain a plurality of decision tree classifiers based on the first sample set and each second sample set, and training the decision tree classifiers by adopting a random forest algorithm to obtain the random forest classifier of the target field data.
10. A data sorting apparatus, comprising:
the data acquisition and preprocessing module is used for acquiring target field data and preprocessing the target field data;
the feature extraction module is used for converting the target field data into text data, and extracting features of the text data through a pre-trained neural network to obtain a training data set of the target field data;
the first sample set extraction module is used for determining time-varying weights of all sample data in the training data set and extracting the first sample set from the training data set based on the time-varying weights;
the second sample set selection module is used for randomly selecting a corresponding feature subset according to the feature importance index aiming at each decision tree to obtain a plurality of second sample sets;
the data classification module is used for training the random forest classifier for obtaining the target field data through a random forest algorithm according to the first sample set and the second sample set so as to realize data classification of the target field data through the random forest classifier.
11. The data classification device of claim 10, wherein the data acquisition and preprocessing module is specifically configured to:
And carrying out missing value processing, abnormal value processing and data transformation processing on the target field data.
12. The data classification device of claim 10, wherein the first sample set extraction module is specifically configured to:
sorting each sample data in the training data set according to time, and setting the corresponding time-varying weight for each sample data according to the time sequence;
and performing put-back sampling in the training data set according to the time-varying weight to obtain the first sample set.
13. The data classification device of claim 12, wherein the second sample set selection module is specifically configured to:
for each decision tree, calculating the characteristic importance index of each sample data in the decision tree, and arranging the sample data in a descending order according to the characteristic importance index;
and randomly selecting the second sample set corresponding to the decision tree according to the sorting order of the sample data.
14. The data classification device of claim 13, wherein the data classification module is specifically configured to:
training to obtain a plurality of decision tree classifiers based on the first sample set and each second sample set, and training the decision tree classifiers by adopting a random forest algorithm to obtain the random forest classifier of the target field data.
15. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method of any of claims 1-9.
16. An electronic device, comprising:
a processor;
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform the method of any of claims 1-9 via execution of the executable instructions.
CN202311304121.0A 2023-10-09 2023-10-09 Data classification method and device, storage medium and electronic equipment Pending CN117473421A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311304121.0A CN117473421A (en) 2023-10-09 2023-10-09 Data classification method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311304121.0A CN117473421A (en) 2023-10-09 2023-10-09 Data classification method and device, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN117473421A true CN117473421A (en) 2024-01-30

Family

ID=89624663

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311304121.0A Pending CN117473421A (en) 2023-10-09 2023-10-09 Data classification method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN117473421A (en)

Similar Documents

Publication Publication Date Title
CN109344908B (en) Method and apparatus for generating a model
Shanthamallu et al. A brief survey of machine learning methods and their sensor and IoT applications
EP3940591A1 (en) Image generating method, neural network compression method, and related apparatus and device
CN110347873B (en) Video classification method and device, electronic equipment and storage medium
EP3340129B1 (en) Artificial neural network class-based pruning
US20190294975A1 (en) Predicting using digital twins
CN113435253B (en) Multi-source image combined urban area ground surface coverage classification method
CN112994701B (en) Data compression method, device, electronic equipment and computer readable medium
CN111062431A (en) Image clustering method, image clustering device, electronic device, and storage medium
CN110708285A (en) Flow monitoring method, device, medium and electronic equipment
EP4343616A1 (en) Image classification method, model training method, device, storage medium, and computer program
CN115879508A (en) Data processing method and related device
CN114511733A (en) Fine-grained image identification method and device based on weak supervised learning and readable medium
CN111159481A (en) Edge prediction method and device of graph data and terminal equipment
WO2024051655A1 (en) Method and apparatus for processing histopathological whole-slide image, and medium and electronic device
CN115294405B (en) Method, device, equipment and medium for constructing crop disease classification model
CN114565794A (en) Bearing fault classification method, device, equipment and storage medium
Kundu et al. Optimal Machine Learning Based Automated Malaria Parasite Detection and Classification Model Using Blood Smear Images.
CN117473421A (en) Data classification method and device, storage medium and electronic equipment
CN114387465A (en) Image recognition method and device, electronic equipment and computer readable medium
CN114494933A (en) Hydrology monitoring station image recognition monitoring system based on edge intelligence
EP3683733A1 (en) A method, an apparatus and a computer program product for neural networks
CN117636100B (en) Pre-training task model adjustment processing method and device, electronic equipment and medium
CN111898658B (en) Image classification method and device and electronic equipment
CN113140012B (en) Image processing method, device, medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination