CN114358153A - Data classification method and device - Google Patents

Data classification method and device Download PDF

Info

Publication number
CN114358153A
CN114358153A CN202111575925.5A CN202111575925A CN114358153A CN 114358153 A CN114358153 A CN 114358153A CN 202111575925 A CN202111575925 A CN 202111575925A CN 114358153 A CN114358153 A CN 114358153A
Authority
CN
China
Prior art keywords
data
evaluation value
category
type
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111575925.5A
Other languages
Chinese (zh)
Inventor
王文举
陈立力
周明伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Dahua Technology Co Ltd
Original Assignee
Zhejiang Dahua Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Dahua Technology Co Ltd filed Critical Zhejiang Dahua Technology Co Ltd
Priority to CN202111575925.5A priority Critical patent/CN114358153A/en
Publication of CN114358153A publication Critical patent/CN114358153A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a data classification method and device, which are used for improving the accuracy of data classification. The method comprises the following steps: acquiring text type data and numerical type data included in data to be classified; analyzing the text type data by adopting a first preset algorithm to obtain a first evaluation value corresponding to each candidate data category in at least one candidate data category; wherein the first evaluation value is determined based on a probability that the text-type data belongs to the corresponding candidate data category; analyzing the numerical data by adopting a second preset algorithm to obtain a second evaluation value corresponding to each candidate data type; the second evaluation value is determined based on the probability that the numerical data belongs to the corresponding candidate data category; determining a target data category corresponding to the data to be classified according to the first evaluation value corresponding to each candidate data category and the second evaluation value corresponding to each candidate data type; the target data category includes a candidate data category of the at least one candidate data category.

Description

Data classification method and device
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a data classification method and apparatus.
Background
Currently, data stored in a database is mainly classified into two categories, namely text type data and numerical type data. The text-type data refers to data composed of characters, and for example, "jurisdiction", "host unit", and the like stored in the database are text-type data. Numerical data refers to data made up of numbers, characters, and the like, and for example, data stored in the data such as "@ @ 330603" is numerical data. In data classification, it is introduced in the related art that for text-type data, a Named Entity Recognition (NER) model is generally adopted to extract Named entities in the text-type data and determine the data classification. For numerical data, a neural network model such as LightGBM can be generally used to classify the numerical data.
The above-described data classification method is effective when it is directed to data including a single category. However, if a data to be classified includes both text-type data and numerical-type data, the above-mentioned method for classifying data may cause an inaccurate classification problem.
Disclosure of Invention
The exemplary embodiments of the present application provide a data classification method and apparatus, so as to improve the accuracy of data classification.
In a first aspect, an embodiment of the present application provides a data classification method, including:
acquiring text type data and numerical type data included in data to be classified;
analyzing the text type data by adopting a first preset algorithm to obtain a first evaluation value corresponding to each candidate data category in at least one candidate data category; wherein the first evaluation value is determined based on a probability that the text-type data belongs to a corresponding candidate data category;
analyzing the numerical data by adopting a second preset algorithm to obtain a second evaluation value corresponding to each candidate data type; the second evaluation value is determined based on a probability that the numerical data belongs to a corresponding candidate data category;
determining a target data category corresponding to the data to be classified according to the first evaluation value corresponding to each candidate data category and the second evaluation value corresponding to each candidate data type; the target data category includes a candidate data category of the at least one candidate data category.
In the related art, if data including both text-type data and numerical-type data needs to be classified, a single algorithm or model is generally used for data classification, which results in inaccurate classification results. The method and the device have the advantages that the data which simultaneously comprise the text type data and the numerical type data are split, the text type data and the numerical type data are classified by adopting different algorithms respectively, and the data category of the data to be classified is determined by combining the determined data categories of the two types of data, so that the accuracy of data classification is improved.
In some embodiments, the determining, according to the first evaluation value corresponding to each candidate data category and the second evaluation value corresponding to each candidate data type, the target data category corresponding to the data to be classified includes:
determining a comprehensive evaluation value corresponding to each candidate data type according to the first evaluation value corresponding to each candidate data type and the second evaluation value corresponding to each candidate data type;
and determining the candidate data category corresponding to the maximum value in the determined comprehensive evaluation values as the target data category.
In some embodiments, the determining a comprehensive evaluation value corresponding to each candidate data category according to the first evaluation value corresponding to each candidate data category and the second evaluation value corresponding to each candidate data type includes:
and aiming at any candidate data category in the at least one candidate data category, calculating a first evaluation value and a second evaluation value corresponding to the any candidate data category by adopting a logistic regression algorithm to obtain a comprehensive evaluation value corresponding to the any candidate category.
In some embodiments, the data to be classified is data in a table format; the acquiring text type data and numerical type data included in the data to be classified includes:
determining the text type data and the numerical type data in the data to be classified according to the name field included in the header of the data to be classified;
and acquiring the text type data and the numerical type data.
In some embodiments, the analyzing the text-type data by using a first preset algorithm to obtain a first evaluation value corresponding to each of at least one candidate data category includes:
converting the textual data into at least one word vector;
and inputting the at least one word vector into a pre-trained recurrent neural network model to obtain a first evaluation value corresponding to each candidate data category in at least one candidate data category.
The data category of the text type data is determined by adopting a word vector conversion mode, so that the classification accuracy of the text type data is improved.
In some embodiments, the method further comprises:
and based on a pre-configured abnormal database, eliminating abnormal data included in the text type data and the numerical type data.
In a second aspect, an embodiment of the present application provides a data classification apparatus, including:
the acquiring unit is used for acquiring text type data and numerical type data included in the data to be classified;
a processing unit configured to perform:
analyzing the text type data by adopting a first preset algorithm to obtain a first evaluation value corresponding to each candidate data category in at least one candidate data category; wherein the first evaluation value is determined based on a probability that the text-type data belongs to a corresponding candidate data category;
analyzing the numerical data by adopting a second preset algorithm to obtain a second evaluation value corresponding to each candidate data type; the second evaluation value is determined based on a probability that the numerical data belongs to a corresponding candidate data category;
determining a target data category corresponding to the data to be classified according to the first evaluation value corresponding to each candidate data category and the second evaluation value corresponding to each candidate data type; the target data category includes a candidate data category of the at least one candidate data category.
In some embodiments, the processing unit is specifically configured to:
determining a comprehensive evaluation value corresponding to each candidate data type according to the first evaluation value corresponding to each candidate data type and the second evaluation value corresponding to each candidate data type;
and determining the candidate data category corresponding to the maximum value in the determined comprehensive evaluation values as the target data category.
In some embodiments, the processing unit is specifically configured to:
and aiming at any candidate data category in the at least one candidate data category, calculating a first evaluation value and a second evaluation value corresponding to the any candidate data category by adopting a logistic regression algorithm to obtain a comprehensive evaluation value corresponding to the any candidate category.
In some embodiments, the data to be classified is data in a table format; the processing unit is further configured to:
determining the text type data and the numerical type data in the data to be classified according to the name field included in the header of the data to be classified;
instructing the obtaining unit to obtain the text-type data and the numerical-type data.
In some embodiments, the processing unit is specifically configured to:
converting the textual data into at least one word vector;
and inputting the at least one word vector into a pre-trained recurrent neural network model to obtain a first evaluation value corresponding to each candidate data category in at least one candidate data category.
In some embodiments, the processing unit is further configured to:
and based on a pre-configured abnormal database, eliminating abnormal data included in the text type data and the numerical type data.
In a third aspect, an embodiment of the present application provides an electronic device, which includes a controller and a memory. The memory is used for storing computer-executable instructions, and the controller executes the computer-executable instructions in the memory to perform the operational steps of any one of the possible implementations of the method according to the first aspect by using hardware resources in the controller.
In a fourth aspect, the present application provides a computer-readable storage medium having stored therein instructions, which when executed on a computer, cause the computer to perform the method of the above-described aspects.
In addition, the beneficial effects of the second aspect to the fourth aspect can be referred to as the beneficial effects of the first aspect, and are not described herein again.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application.
Fig. 1 is a flowchart of a data classification method according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a TextCNN model provided in an embodiment of the present application;
FIG. 3 is a graph of a feature transfer function provided in an embodiment of the present application;
FIG. 4 is a line graph for characterizing the degree of probability association between a feature and a category according to an embodiment of the present disclosure;
FIG. 5 is a histogram of a training sample distribution provided by an embodiment of the present application;
FIG. 6 is a schematic diagram of a confusion matrix according to an embodiment of the present application;
FIG. 7 is a model evaluation report provided by an embodiment of the present application;
fig. 8 is a schematic structural diagram of a data classification apparatus according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
According to the technical scheme, the data acquisition, storage, use, processing and the like meet relevant regulations of national laws and regulations.
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments, but not all embodiments, of the technical solutions of the present application. All other embodiments obtained by a person skilled in the art without any inventive step based on the embodiments described in the present application are within the scope of the protection of the present application.
The terms "first" and "second" in the description and claims of the present application and the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the term "comprises" and any variations thereof, which are intended to cover non-exclusive protection. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus. The "plurality" in the present application may mean at least two, for example, two, three or more, and the embodiments of the present application are not limited.
In addition, the term "and/or" herein is only one kind of association relationship describing an associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in this document generally indicates that the preceding and following related objects are in an "or" relationship unless otherwise specified.
In the related art, in order to ensure the accuracy of data classification, different data classification methods are generally adopted for data in different formats. For example, text-based data is preferably classified using the NER model, and numerical data is preferably classified using the pre-trained LightGBM model. However, for data to be classified including both text type data and numerical type data, there is no way to ensure the accuracy of classification by using any of the above classification methods. In view of this, an embodiment of the present application provides a data classification method, which provides that text-type data and numerical data in data to be classified are separated, different analysis methods are respectively adopted to determine a data category of the text-type data and a data category of data line data, and the data category of the data to be classified is determined according to the two data categories, so as to improve accuracy of data classification.
Optionally, the scheme provided by the application can be applied to a scene of data classification storage. For example, before storing data in a database, the data category may be determined by using the scheme provided in the present application, and then the data categories are stored, that is, data of one data category may be stored in the same storage space, so as to be used subsequently by a user.
First, to facilitate understanding of the scheme proposed in the present application, referring to fig. 1, a flowchart of a data classification method provided in an embodiment of the present application is provided. The execution subject of the execution data classification method in the embodiment of the present application is not particularly limited, and for example, the data classification method may be executed by a terminal such as a computer or a mobile phone, or may be executed by an electronic device such as a server or a chip. The method flow shown in fig. 1 specifically includes:
101, acquiring text type data and numerical type data included in the data to be classified.
Optionally, the data to be classified may be split to obtain text type data and numerical type data.
And 102, analyzing the text type data by adopting a first preset algorithm to obtain a first evaluation value corresponding to each candidate data type.
Wherein the first evaluation value is determined based on a probability that the text-type data belongs to the corresponding candidate data category. For example, the candidate data categories include data category a, data category B, and data category C, the probability of the text-type data being assigned to data category a is 0.1, the probability of the text-type data being assigned to data category B is 0.6, and the probability of the text-type data being assigned to data category C is 0.9. Then the first evaluation value for data class a is 0.1, the first evaluation value for data class B is 0.6, and the first evaluation value for data class C is 0.9.
Alternatively, the candidate data categories may include codes, numbers, names, indicators, descriptions, dates, datetime, amounts, or unknown data categories. The probability that the text data respectively belongs to each candidate data category can be determined by adopting a first preset algorithm, and the first evaluation value corresponding to each candidate category is determined based on the probability.
And 103, analyzing the numerical data by adopting a second preset algorithm to obtain a second evaluation value corresponding to each candidate data type.
Wherein the second evaluation value is determined based on a probability that numerical data belongs to a corresponding candidate data category.
And 104, determining the target data type corresponding to the data to be classified according to the first evaluation value and the second evaluation value corresponding to each candidate data type.
As an alternative, a comprehensive evaluation value corresponding to each candidate data category may be determined based on the first evaluation value and the second evaluation value. The composite evaluation value can be used for characterizing the probability of the data to be classified belonging to each candidate data category. Further, the candidate data category corresponding to the maximum value among the determined plurality of composite evaluation values may be set as the target data category.
Alternatively, the first evaluation value and the second evaluation value may be calculated using a set algorithm to obtain a comprehensive evaluation value. As an example, it is assumed that the probability that the text-type data belongs to the data category a is 0.1, that is, the first evaluation value corresponding to the data category a is 0.1. The probability of the numerical data belonging to the data category a is set to 0.3, that is, the second evaluation value corresponding to the data category a is set to 0.3. Further, 0.1 and 0.3 may be calculated by using a set algorithm, and the calculation result is used as a comprehensive evaluation value corresponding to the data class a, that is, the probability that the data to be classified belongs to the data class a. Still further, the above-described method may be employed to calculate a comprehensive evaluation value corresponding to each candidate data category, and the candidate data category corresponding to the maximum value among the plurality of comprehensive evaluation values may be taken as the target data category.
Based on the scheme, the method and the device have the advantages that text type data and numerical type data in the data are separated, the two types of data are analyzed through different algorithms respectively, the data type of the data to be classified is determined according to the result obtained through analysis, the problem that a single algorithm is not suitable for data classification containing two formats at the same time is solved, and the accuracy of data classification is improved.
In some scenarios, the data to be classified may be data in a table form, for example, see table 1 below, where each row of data except the header in table 1 may be regarded as one data to be classified. Optionally, when the data to be classified is split to obtain the text-type data and the numerical-type data, the text-type data and the numerical-type data in the data to be classified may be determined according to the name field included in the header. For example, when a field such as "name" or "text data" in the header is recognized, it is possible that a list of data corresponding to the name is determined to be text type data. When the fields such as "content data information", "data", or "content information" in the header are identified, it can be determined that a list of data corresponding to the name is numerical data.
TABLE 1
Figure BDA0003425135250000081
Optionally, after the text-type data and the numerical data included in the data to be classified are acquired, a data cleaning step may be performed, and abnormal data included in the text-type data and the numerical data may be eliminated (the abnormal data may also be an abnormal value or an invalid value). For example, the exception data may include: "none", "@ -1", or a space, etc. As an alternative, an exception database may be configured in advance, and text-type data and numerical-type data exception data are eliminated based on the exception database. For example, the abnormal data included in the abnormal database may be matched with the text-type data and the numerical-type data, and if there is matched data, the text-type data or the numerical-type data matched with the abnormal data in the abnormal database may be deleted. For example, if the anomaly database includes the anomaly data of "@", and the numerical data also includes "@", as shown in table 1, the data matching between the numerical data and the anomaly data in the anomaly database is successful, and the "@" in the numerical data may be deleted.
In some embodiments, after the text-type data and the numerical-type data are acquired, the two types of data may be analyzed by using an algorithm preset for the two types of data, and the first evaluation value and the second evaluation value corresponding to each type of candidate data may be determined. First, a process of analyzing text-type data to obtain first evaluation values corresponding to a plurality of candidate data categories is described. It should be noted that the following analysis method for the text data is only an example, and the model, the algorithm, and the processing sequence adopted in the text data analysis process are not specifically limited in the present application.
In one possible implementation, the text-type data may be subjected to word segmentation, for example, the text-type data may be split with granularity of "word" or "word group". Taking the text type data "host unit" shown in table 1 above as an example, the "host unit" may be subjected to word segmentation processing to obtain "host" and "unit". As an optional manner, after performing word segmentation, the words or phrases obtained after word segmentation may be directly input into a pre-trained model to obtain a word vector corresponding to each word or phrase. As another optional mode, after the word segmentation process is performed, the word group or the word obtained after the word segmentation may be encoded to obtain an initial word vector, and then the word vector is input into a pre-trained model to be corrected to obtain a word vector corresponding to each word or word group.
Optionally, after obtaining the word vectors corresponding to the text-based data, the word vectors may be input into a convolutional neural network trained in advance to obtain a probability that the text-based data belongs to each candidate data category, that is, the first evaluation value corresponding to each candidate data category. Alternatively, when the application proposes to predict the first evaluation value, the used convolutional neural network model may adopt a TextCNN model. As an example, the TextCNN model proposed in the present application may include an input layer, an embedding layer (also referred to as an embedding layer), a convolutional layer, a max pooling layer, a full connection layer, and an output layer, and optionally, the structure of the TextCNN model may be as shown in fig. 2. In the following, the functions of the network layers in the TextCNN model are briefly described:
embedding layer: for adjusting the parameters of the individual word vectors input into the TextCNN model. Alternatively, referring to fig. 2, the embedding layer may convert the input word vectors into a matrix.
And (3) rolling layers: the method can be used for feature extraction of the matrix output by the embedding layer, and different convolution kernels can extract different features. The input of different candidate data classes may activate different convolution kernels, resulting in different features. Alternatively, the size of the convolution kernel in the convolution layer, the number of convolution layers, and the number of convolution kernels may be set in advance. Since a large number of convolutional layers may cause a problem of gradient explosion or gradient disappearance, the present application proposes to set the number of convolutional layers to one for convolution operation.
Maximum pooling layer: used for splicing the output characteristics of the convolutional layer. Specifically, the maximum pooling layer may maximize a plurality of eigenvectors output by the convolutional layer, and then splice together as an output value of the maximum pooling layer.
Full connection layer: and splicing the eigenvectors output by the largest pooling layer again to obtain an output result.
Alternatively, the TextCNN model may obtain the first evaluation values respectively corresponding to at least one candidate data category, that is, the probability that the text-type data belongs to each candidate data category. In the above, description has been made on the process of obtaining the first evaluation value by analyzing the text-type data. Next, description will be made on a process of analyzing numerical data to obtain a second evaluation value. Similarly, the following analysis method for numerical data is only an example, and the present application does not specifically limit the model, algorithm, and processing sequence used in the analysis process of numerical data.
As an alternative, when analyzing the numerical data, the feature extraction may be performed on the numerical data first. And then inputting the extracted characteristics of the numerical data into a pre-trained model to obtain the probability that the numerical data belongs to each candidate data category, namely a second evaluation value corresponding to each candidate data category. Alternatively, the model for classifying numerical data may employ an ensemble learning model.
As an example, the characteristics of the numerical data may include: length information, numerical value information, type information, general information, and the like. Alternatively, the characteristics such as the length information or the numerical information may be composed of characteristics such as a maximum value, a minimum value, a mean value, a variance, a standard deviation, a mode, or a median. The length information may be used to characterize the length of the numerical data, or may also be understood as the length of bytes included in the numerical data, or the occupation of data. The numerical information is a specific numerical value included in the numerical data. The type information is used to characterize the format of data included in the numeric data, i.e., the format of a character string or the like. The general information is used for representing the numerical data after data cleaning, namely the numerical data after abnormal data are removed.
In some embodiments, after the features of the numerical data are extracted, the extracted features may be further processed. For example, for numerical information, if some data is too different from the average value, the data may be subjected to feature engineering conversion. For example, the data that differs significantly from the average value may be logarithmically converted, i.e., processed using a logarithmic function. Optionally, before the data are logarithmically converted, some data with smaller numerical values may be removed first. For example, when the value of some data is between 0 and 1, the data may not be subjected to feature engineering conversion. In particular, reference may be made to fig. 3, which is a graph of a characteristic transfer function provided for an embodiment of the present application. In the graph shown in fig. 3, the abscissa indicates data before the feature conversion, i.e., features extracted from numerical data. The ordinate represents data after feature conversion.
In other embodiments, after the features of the numerical data are extracted, the features can be deleted to avoid feature redundancy. For example, if the extracted standard deviation and variance features are the same in function, then only one feature may be retained. Alternatively, the correlation between any two of the plurality of features may be determined first, and the two features with the greater correlation may be replaced with each other. For example, the correlation between the two features, standard deviation and variance, is large, and the correlation between the standard deviation and maximum value is small. The standard deviation can be used instead of the variance (i.e., the feature of variance can be removed and only the standard deviation is retained), but the standard deviation cannot be used instead of the maximum (i.e., both features are retained) because of the small correlation between the standard deviation and the maximum. Further, the degree of contribution of each feature to the second evaluation value may be continuously calculated, thereby determining the retention feature. For example, if the correlation between the standard deviation and the variance is large and the standard deviation contributes to the determination of the second evaluation value to a large extent, the feature of the variance may be deleted and the feature of the standard deviation may be retained to determine the second evaluation value.
As an example, in determining the correlation between features, a pearson correlation coefficient (alternatively referred to as a pearson product-moment correlation coefficient) may be used to analyze the correlation between any two features. When the Pearson correlation coefficient of any two features takes a value of 0, the two features are mutually independent. When the Pearson correlation coefficient of any two characteristics takes 1, the two characteristics are strongly and positively correlated. When the Pearson correlation coefficient of any two features takes the value of-1, the two features are strongly and negatively correlated. Further, after the correlation between the features is determined, the degree of contribution of each feature to the determination of the second evaluation value may be further determined. Alternatively, the percentage contribution of each feature to the determination of the second evaluation value may be analyzed by a mutual information verification analysis method. For example, fig. 4 may be referred to as a line graph for characterizing the degree of association between the feature and the second evaluation value according to an embodiment of the present application. The abscissa of the graph shown in fig. 4 is the respective features, and the ordinate is the percentage contribution of each feature to the determination of the second evaluation value.
After the features of the numerical data are extracted and the features are subjected to correlation processing, the processed features may be further input into a pre-trained ensemble learning model, and second evaluation values corresponding to a plurality of candidate data categories may be determined. As an alternative, the present application provides that a LightGBM model may be used to classify numerical data. The LightGBM model adopts a Gradient lifting algorithm improved based on a Gradient lifting Tree (GBDT) algorithm, and can be applied to the scenes of classification, regression, sequencing and the like. For ease of understanding, the GBDT algorithm is briefly described:
the GBDT algorithm is a representative algorithm in boosting series algorithms, is an iterative decision tree algorithm, and consists of a plurality of decision trees, and the conclusions of all the decision trees are accumulated to serve as a final result. Specifically, in the GBDT iteration process, a weak learner is found in the current iteration according to the strong learner and the loss function obtained in the previous iteration, so that the loss of the current iteration is minimized. That is, the decision tree found in the iteration of this round is to make the loss function of this round as smaller as possible.
For example, there are N samples, m basis learners are established, the corresponding number of leaf nodes is determined according to the GBDT decision tree depth, and J leaf nodes are set. First, a constant value c is set for the base learner that initially predicts. y isiRepresenting the feature vector of the ith sample. The loss function can be shown in formula (1):
Figure BDA0003425135250000131
wherein F (x) is a cost function, N is the number of training samples, yiIs the label of the ith sample, c is a constant value, L (y)iAnd c) represents a loss function.
Building a loss function of m-1 trees, calculating a negative gradient and fitting the sample residuals of the mth decision tree, for example, as shown in formula (2):
Figure BDA0003425135250000132
wherein, γi,mResidual, x, of the m-th decision tree representing the ith sampleiInput data representing the ith sample, fm-1(xi) Represents the predicted result of the m-1 decision tree, yiIs the true label of the ith sample, L (y)i,fm-1(xi) Represents the loss of the m-1 decision tree.
The best fit value is calculated, see equation (3):
Figure BDA0003425135250000133
wherein, γj,mResidual, x, of the mth decision tree representing the jth sampleiInput data representing the ith sample, fm-1(xi) Represents the predicted result of the m-1 decision tree, yiγ represents the residual to be learned, which is the true label of the ith sample.
Further, according to the fitting results of the m decision trees, the mth decision tree is obtained by updating, for example, as shown in formula (4):
Figure BDA0003425135250000134
wherein f ism(x) Representing the fitting results of m decision trees, fm-1(x) The fitting results of the first m-1 decision trees are shown, J is the number of leaf nodes,γj,mrepresents weight information, and i (x) represents a division section.
Finally, the results of the m decision trees are weighted to obtain the final result, which can be shown in formula (5), for example:
Figure BDA0003425135250000141
wherein F (x) represents a cost function, f0(x) Denotes the initial basis learner, J is the number of leaf nodes, γj,mThe model parameter to be learned is represented, and I (x) represents the segmentation interval.
In the above, the GBDT algorithm is introduced. Alternatively, the parameters of the model may be set according to an empirical method or a network search before the LightGBM model is used for classification. For example, the learning rate (learning _ rate) in the model may be set to 0.1, the number of leaves (num leaves) may be set to 38, the number of leaves with the least number of leaf nodes (min _ data _ in _ l)eaf) The various model parameters described above may be used to avoid over-fitting the model, which may be set at 170.
In some embodiments, after the plurality of candidate data categories are determined to correspond to the plurality of first evaluation values and the plurality of second evaluation values, respectively, a setting algorithm may be employed to determine the integrated evaluation value based on the first evaluation value and the second evaluation value. As an example, a logistic regression algorithm may be used to calculate the first and second evaluation values, and a comprehensive evaluation value is obtained. For example, when the candidate data category is data category a, the probability that text-type data belongs to data category a is 0.8, and the probability that numeric-type data belongs to data category a is 0.9, then the first evaluation value corresponding to data category a is 0.8, and the second evaluation value corresponding to data category a is 0.9, (0.8 × setting coefficient 1+0.9 × setting coefficient 2) may be taken as the probability that text to be classified belongs to data category a, that is, the comprehensive evaluation value of data category a. Further, the above method may be adopted to determine the probability that the text to be classified belongs to each candidate data category, that is, calculate the comprehensive evaluation value corresponding to each candidate data category. Further, the candidate data category corresponding to the maximum value among all the comprehensive evaluation values may be taken as the target data category of the text to be classified, that is, the classification result of the text to be classified.
In the above, the process of classifying the text to be classified by using various models is introduced. In some scenarios, before the text to be classified is classified by using the above algorithms and models, some training samples may be obtained first to train the above mentioned models. Optionally, the training samples may include text-type data and numerical-type data, and may further include actual data category labels of the training samples that are manually added. The model may be trained based on the data classes predicted by the model and artificial data class labels.
In some cases, the data class distribution of the acquired training samples is not uniform, for example, see fig. 5, which is a training sample distribution histogram provided in this embodiment of the present application. It can be seen that the number of training samples with data type of name, number and code is large, and the number of data type of money and other training samples is small. Based on this situation, the present application proposes that a negative sampling method can be used to equalize the training samples. Further, the training samples may be input into the above mentioned model, and the data category of the training sample predicted by the model is obtained. Optionally, the model may also be evaluated according to the data classes predicted by the model and the actual data classes of the training samples.
Alternatively, the model may be evaluated by using common indexes such as a confusion matrix, an error classification rate, a recall rate, an accuracy rate, and a comprehensive evaluation index (F1-measure). Wherein the confusion matrix is a specific matrix used for presenting model performance. For example, fig. 6 may be a schematic diagram of a confusion matrix provided in the embodiments of the present application. In fig. 6, data in each row represents the real data class of the training sample, and data in each column represents the data class of the training sample predicted based on the model. That is, the value of the confusion matrix diagonal represents the number of samples for which the prediction is accurate. Further, an error classification rate may be determined from the confusion matrix, the error classification rate referring to an error rate at which the data class of the training samples is predicted to be erroneous. The accuracy is in terms of the prediction. For a certain training sample, the prediction results are of two types: positive and negative classes, i.e. prediction correct and prediction incorrect. There are two possibilities to predict the result as positive class: one is to predict the positive sample as positive, i.e. a sample should be data type a, and the prediction result is also data type a. Another is to predict a negative class sample as a positive class, i.e. a sample is not data class a, but the prediction result indicates that the sample is data class a. The accuracy is used for representing the true positive sample ratio in the samples with the positive prediction results. Recall is for training samples. For the positive samples in the training samples, the prediction results have two possibilities: one is to predict the positive class samples as a positive class, and the other is to predict the positive class samples as a negative class. The recall ratio is used for representing the positive sample proportion with accurate prediction in the training samples. The F1-measure value is the average value of the recall rate and the accuracy rate.
Specifically, when the model is evaluated, the prediction result and the real label of the manually added sample are combined to determine the above indexes, so that the model can be evaluated. For example, see fig. 7 for a model evaluation report provided by an embodiment of the present application. Fig. 7 includes values of each index and the number of samples for each data category.
Based on the same concept as the method described above, referring to fig. 8, a data classification apparatus 800 is provided for the embodiment of the present application. The apparatus 800 may be configured to perform the steps of the above method, and therefore, in order to avoid repetition, the detailed description is omitted here. The apparatus 800 comprises: an acquisition unit 801 and a processing unit 802.
An obtaining unit 801, configured to obtain text-type data and numerical-type data included in data to be classified;
a processing unit 802 configured to perform:
analyzing the text type data by adopting a first preset algorithm to obtain a first evaluation value corresponding to each candidate data category in at least one candidate data category; wherein the first evaluation value is determined based on a probability that the text-type data belongs to a corresponding candidate data category;
analyzing the numerical data by adopting a second preset algorithm to obtain a second evaluation value corresponding to each candidate data type; the second evaluation value is determined based on a probability that the numerical data belongs to a corresponding candidate data category;
determining a target data category corresponding to the data to be classified according to the first evaluation value corresponding to each candidate data category and the second evaluation value corresponding to each candidate data type; the target data category includes a candidate data category of the at least one candidate data category.
In some embodiments, the processing unit 802 is specifically configured to:
determining a comprehensive evaluation value corresponding to each candidate data type according to the first evaluation value corresponding to each candidate data type and the second evaluation value corresponding to each candidate data type;
and determining the candidate data category corresponding to the maximum value in the determined comprehensive evaluation values as the target data category.
In some embodiments, the processing unit 802 is specifically configured to:
and aiming at any candidate data category in the at least one candidate data category, calculating a first evaluation value and a second evaluation value corresponding to the any candidate data category by adopting a logistic regression algorithm to obtain a comprehensive evaluation value corresponding to the any candidate category.
In some embodiments, the data to be classified is data in a table format; the processing unit 802 is further configured to:
determining the text type data and the numerical type data in the data to be classified according to the name field included in the header of the data to be classified;
the acquisition unit 801 is instructed to acquire the text type data and the numerical type data.
In some embodiments, the processing unit 802 is specifically configured to:
converting the textual data into at least one word vector;
and inputting the at least one word vector into a pre-trained recurrent neural network model to obtain a first evaluation value corresponding to each candidate data category in at least one candidate data category.
In some embodiments, the processing unit 802 is further configured to:
and based on a pre-configured abnormal database, eliminating abnormal data included in the text type data and the numerical type data.
Fig. 9 shows a schematic structural diagram of an electronic device 900 provided in an embodiment of the present application. The electronic device 900 in this embodiment of the application may further include a communication interface 903, where the communication interface 903 is, for example, a network port, and the electronic device may transmit data through the communication interface 903, for example, the communication interface 903 may be used to obtain data to be classified from an external device or from a database.
In this embodiment, the memory 902 stores instructions executable by the at least one controller 901, and the at least one controller 901 may be configured to execute the steps in the foregoing method by executing the instructions stored in the memory 902, for example, the controller 901 may implement the functions of the obtaining unit 801 and the processing unit 802 in fig. 8.
The controller 901 is a control center of the electronic device, and may connect various parts of the electronic device by using various interfaces and lines, by executing or executing instructions stored in the memory 902 and calling up data stored in the memory 902. Alternatively, the controller 901 may include one or more processing units, and the controller 901 may integrate an application controller and a modem controller, wherein the application controller mainly processes an operating system, an application program, and the like, and the modem controller mainly processes wireless communication. It is to be understood that the modem controller described above may not be integrated into the controller 901. In some embodiments, the controller 901 and the memory 902 may be implemented on the same chip, or in some embodiments, they may be implemented separately on separate chips.
The controller 901 may be a general-purpose controller, such as a Central Processing Unit (CPU), a digital signal controller, an application specific integrated circuit, a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, and may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present application. The general controller may be a microcontroller or any conventional controller or the like. The steps executed by the data statistics platform disclosed in the embodiments of the present application may be directly executed by a hardware controller, or may be executed by a combination of hardware and software modules in the controller.
Memory 902, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory 902 may include at least one type of storage medium, and may include, for example, a flash Memory, a hard disk, a multimedia card, a card-type Memory, a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Programmable Read Only Memory (PROM), a Read Only Memory (ROM), a charge Erasable Programmable Read Only Memory (EEPROM), a magnetic Memory, a magnetic disk, an optical disk, and the like. The memory 902 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 902 of the embodiments of the present application may also be circuitry or any other device capable of performing a storage function for storing program instructions and/or data.
By programming the controller 901, for example, codes corresponding to the training method of the neural network model described in the foregoing embodiment may be fixed in a chip, so that the chip can execute the steps of the aforementioned training method of the neural network model when running.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a controller of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the controller of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims (10)

1. A method of data classification, comprising:
acquiring text type data and numerical type data included in data to be classified;
analyzing the text type data by adopting a first preset algorithm to obtain a first evaluation value corresponding to each candidate data category in at least one candidate data category; wherein the first evaluation value is determined based on a probability that the text-type data belongs to a corresponding candidate data category;
analyzing the numerical data by adopting a second preset algorithm to obtain a second evaluation value corresponding to each candidate data type; the second evaluation value is determined based on a probability that the numerical data belongs to a corresponding candidate data category;
determining a target data category corresponding to the data to be classified according to the first evaluation value corresponding to each candidate data category and the second evaluation value corresponding to each candidate data type; the target data category includes a candidate data category of the at least one candidate data category.
2. The method of claim 1, wherein the determining the target data category corresponding to the data to be classified according to the first evaluation value corresponding to each candidate data category and the second evaluation value corresponding to each candidate data type comprises:
determining a comprehensive evaluation value corresponding to each candidate data type according to the first evaluation value corresponding to each candidate data type and the second evaluation value corresponding to each candidate data type;
and determining the candidate data category corresponding to the maximum value in the determined comprehensive evaluation values as the target data category.
3. The method of claim 2, wherein determining the composite evaluation value corresponding to each of the candidate data categories based on the first evaluation value corresponding to the respective candidate data category and the second evaluation value corresponding to the respective candidate data type comprises:
and aiming at any candidate data category in the at least one candidate data category, calculating a first evaluation value and a second evaluation value corresponding to the any candidate data category by adopting a logistic regression algorithm to obtain a comprehensive evaluation value corresponding to the any candidate category.
4. The method according to any one of claims 1-3, wherein the data to be classified is data in a table format; the acquiring text type data and numerical type data included in the data to be classified includes:
determining the text type data and the numerical type data in the data to be classified according to the name field included in the header of the data to be classified;
and acquiring the text type data and the numerical type data.
5. The method of any of claims 1-3, wherein the parsing the textual data using a first predetermined algorithm to obtain a first evaluation value for each of at least one candidate data category comprises:
converting the textual data into at least one word vector;
and inputting the at least one word vector into a pre-trained recurrent neural network model to obtain a first evaluation value corresponding to each candidate data category in at least one candidate data category.
6. The method of any one of claims 1-3, further comprising:
and based on a pre-configured abnormal database, eliminating abnormal data included in the text type data and the numerical type data.
7. A data sorting apparatus, comprising:
the acquiring unit is used for acquiring text type data and numerical type data included in the data to be classified;
a processing unit configured to perform:
analyzing the text type data by adopting a first preset algorithm to obtain a first evaluation value corresponding to each candidate data category in at least one candidate data category; wherein the first evaluation value is determined based on a probability that the text-type data belongs to a corresponding candidate data category;
analyzing the numerical data by adopting a second preset algorithm to obtain a second evaluation value corresponding to each candidate data type; the second evaluation value is determined based on a probability that the numerical data belongs to a corresponding candidate data category;
determining a target data category corresponding to the data to be classified according to the first evaluation value corresponding to each candidate data category and the second evaluation value corresponding to each candidate data type; the target data category includes a candidate data category of the at least one candidate data category.
8. The apparatus as claimed in claim 7, wherein said processing unit is specifically configured to:
determining a comprehensive evaluation value corresponding to each candidate data type according to the first evaluation value corresponding to each candidate data type and the second evaluation value corresponding to each candidate data type;
and determining the candidate data category corresponding to the maximum value in the determined comprehensive evaluation values as the target data category.
9. An electronic device, comprising a controller and a memory,
the memory for storing computer programs or instructions;
the controller for executing a computer program or instructions in a memory, such that the method of any of claims 1-6 is performed.
10. A computer-readable storage medium having stored thereon computer-executable instructions which, when invoked by a computer, cause the computer to perform the method of any one of claims 1 to 6.
CN202111575925.5A 2021-12-22 2021-12-22 Data classification method and device Pending CN114358153A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111575925.5A CN114358153A (en) 2021-12-22 2021-12-22 Data classification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111575925.5A CN114358153A (en) 2021-12-22 2021-12-22 Data classification method and device

Publications (1)

Publication Number Publication Date
CN114358153A true CN114358153A (en) 2022-04-15

Family

ID=81100788

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111575925.5A Pending CN114358153A (en) 2021-12-22 2021-12-22 Data classification method and device

Country Status (1)

Country Link
CN (1) CN114358153A (en)

Similar Documents

Publication Publication Date Title
CN110598206B (en) Text semantic recognition method and device, computer equipment and storage medium
CN111259625A (en) Intention recognition method, device, equipment and computer readable storage medium
CN110135505B (en) Image classification method and device, computer equipment and computer readable storage medium
CN110175851B (en) Cheating behavior detection method and device
CN110532398B (en) Automatic family map construction method based on multi-task joint neural network model
CN108550065B (en) Comment data processing method, device and equipment
CN110442702B (en) Searching method and device, readable storage medium and electronic equipment
CN112257449B (en) Named entity recognition method and device, computer equipment and storage medium
CN113688630B (en) Text content auditing method, device, computer equipment and storage medium
CN111159407A (en) Method, apparatus, device and medium for training entity recognition and relation classification model
CN112819023A (en) Sample set acquisition method and device, computer equipment and storage medium
CN112016313B (en) Spoken language element recognition method and device and warning analysis system
CN112580346B (en) Event extraction method and device, computer equipment and storage medium
CN113742733B (en) Method and device for extracting trigger words of reading and understanding vulnerability event and identifying vulnerability type
CN116049412B (en) Text classification method, model training method, device and electronic equipment
CN113239702A (en) Intention recognition method and device and electronic equipment
CN112347223A (en) Document retrieval method, document retrieval equipment and computer-readable storage medium
CN112966072A (en) Case prediction method and device, electronic device and storage medium
CN115687732A (en) User analysis method and system based on AI and stream computing
CN113420117A (en) Emergency classification method based on multivariate feature fusion
CN111783688B (en) Remote sensing image scene classification method based on convolutional neural network
CN113536784A (en) Text processing method and device, computer equipment and storage medium
CN114254588B (en) Data tag processing method and device
CN114358153A (en) Data classification method and device
CN113821571B (en) Food safety relation extraction method based on BERT and improved PCNN

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination