CN105574544A - Data processing method and device - Google Patents

Data processing method and device Download PDF

Info

Publication number
CN105574544A
CN105574544A CN201510943565.8A CN201510943565A CN105574544A CN 105574544 A CN105574544 A CN 105574544A CN 201510943565 A CN201510943565 A CN 201510943565A CN 105574544 A CN105574544 A CN 105574544A
Authority
CN
China
Prior art keywords
data
random forest
decision tree
forest model
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510943565.8A
Other languages
Chinese (zh)
Inventor
沈雄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201510943565.8A priority Critical patent/CN105574544A/en
Publication of CN105574544A publication Critical patent/CN105574544A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features

Abstract

The invention provides a data processing method and device. The method comprises the following steps of training data with labels in a predetermined format to obtain a first random forest model, wherein the first random forest model comprises T decision trees; carrying out label forecast, namely classification on data without labels in the predetermined format according to the first random forest model; extracting K sample sets from data with real labels after the data acquires the real labels; building K decision trees; combining the first random forest model and the K decision trees into a second random forest model for forecasting the data with the real labels; and calculating a comprehensive performance index of each decision tree in the second random forest model, and taking the T decision trees with the highest comprehensive performance index as the first random forest model. By the method, the classification accuracy of the data can be improved. The invention also provides a data processing device.

Description

A kind of data processing method and device
Technical field
The present invention relates to data processing field, particularly relate to a kind of data processing method and device.
Background technology
At large data age, data are widely used in a lot of field.From a large amount of data, how distinguishing which data is more accurately normal data, and which data is improper data, has become more and more important.Such as, in social security medical insurance field, illegal social security medical insurance arbitrage grows in intensity, and traditional detection arbitrage is based on simple rule or off-line model.Along with the increase of data volume, naive model or off-line model processing power decline, and a lot of offender can attempt walking around rule and model, causes the delayed of social security medicare system prevention and control, be difficult to judge that social security medical insurance transaction is arm's length transaction or abnormal exchanges accurately, as arbitrage transaction.If social security medicare system accurately cannot judge social security medical insurance, transaction belongs to arm's length transaction or abnormal exchanges, will constantly occur arbitrage case, affect the stable operation of social security medical insurance fund.As can be seen here, how accurately to distinguish the classification of data, whether as normal in data or improper be the problem always needing to solve.
Summary of the invention
The invention provides a kind of data processing method improving Data classification accuracy.
In addition, the present invention also provides a kind of device using above-mentioned data processing method.
A kind of data processing method, the method comprises:
Obtain the data not having label of preset format;
Judge whether to set up the first Random Forest model, described first Random Forest model comprises T decision tree;
If set up the first Random Forest model, carried out Tag Estimation according to the data of label that do not have of described first Random Forest model to described preset format, and preserved the result of Tag Estimation;
The data with true tag are obtained according to described Tag Estimation result;
Extract K sample set, wherein K<T with putting back to from the data with true tag;
K decision tree is set up according to a described K sample set;
Described first Random Forest model and a described K decision tree form the second Random Forest model, carry out Tag Estimation by described second Random Forest model to the described data with true tag;
Calculate each decision tree information in described second Random Forest model according to the result of described Tag Estimation respectively, comprise the integrated performance index of each decision tree; And
Delete the minimum K of a described integrated performance index decision tree, using a not deleted T decision tree as the first Random Forest model.
A kind of data processing equipment, this device comprises: acquisition module, judge module, prediction module, sample set abstraction module, decision tree generation module, computing module, removing module;
Described acquisition module, for obtaining the data not having label of preset format;
Described judge module, set up the first Random Forest model for judging whether, described first Random Forest model comprises T decision tree;
Described prediction module, if for setting up the first Random Forest model, carrying out Tag Estimation according to the data of label that do not have of described first Random Forest model to described preset format, and preserving the result of Tag Estimation;
Described acquisition module, for obtaining the data with true tag according to described Tag Estimation result;
Described sample set abstraction module, for extracting K sample set, wherein K<T from the data with true tag with putting back to;
Described decision tree generation module, for setting up K decision tree according to a described K sample set;
Described prediction module, also forms the second Random Forest model for described first Random Forest model and a described K decision tree, carries out Tag Estimation by described second Random Forest model to the described data with true tag;
Described computing module, for calculating each decision tree decision tree information in described second Random Forest model respectively according to the result of described Tag Estimation, comprises the integrated performance index of each decision tree;
Described removing module, deletes the minimum K of a described integrated performance index decision tree, using a not deleted T decision tree as the first Random Forest model.
Above data processing method and device, first utilize the first Random Forest model comprising T decision tree to carry out Tag Estimation to a large amount of data namely to classify, then from the data with true tag, choose a K sample set set up K decision tree according to the true tag of the result determination data of Tag Estimation, then with the second Random Forest model of T decision tree and K decision tree composition, Tag Estimation is carried out to the data with true tag, the integrated performance index in the second Random Forest model is calculated according to the result of Tag Estimation, choose the highest T of an integrated performance index decision tree as the first Random Forest model.First Random Forest model, to not having the data of label to carry out after namely Tag Estimation classify, upgrades the first Random Forest model by sorted data.Because the first Random Forest model upgrades, with the first Random Forest model upgraded, the accuracy that classification can improve classification is carried out to data.
Accompanying drawing explanation
In order to be illustrated more clearly in the concrete scheme in the embodiment of the present invention, below the accompanying drawing used required in describing embodiment is briefly described, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.
Fig. 1 is the process flow diagram of data processing method.
Fig. 2 is the process flow diagram of data preprocessing method.
Fig. 3 is the functional block diagram of data processing equipment.
Fig. 4 is the functional block diagram of acquisition module.
Embodiment
For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with accompanying drawing, the specific embodiment of the present invention is described in further detail.
Fig. 1 is the process flow diagram of data processing method, and this data processing method comprises the steps.
Step S101, obtains the data not having label of preset format.The data of preset format refer to the data with same alike result.These data have many, have the data of arm's length transaction, also have the data of abnormal exchanges.In the present embodiment, represent whether data are arm's length transaction or abnormal exchanges with label.The label of data comprises arm's length transaction and abnormal exchanges two kinds of labels.The data of arm's length transaction are white sample, and the data of abnormal exchanges are black sample.In other implementations, the label of data may have multiple.Be described for social security medical insurance transaction data in the present embodiment.Such as, in social security medical insurance transaction data, there is normal person to use social security medical insurance to see a doctor the data of buying medicine, offender also may be had to use social security medical insurance to see a doctor the transaction data buying medicine.The label of data is arm's length transaction and abnormal exchanges respectively.Do not have the data of label refer to social security medical insurance transaction data as broad as long be white sample (data of arm's length transaction) or black sample (data of abnormal exchanges).
Step S102, judges whether to set up the first Random Forest model, and this first Random Forest model comprises T decision tree.If established the first Random Forest model, perform step S105; If do not set up the first Random Forest model, perform step S103.Wherein, T decision tree is expressed as multiple decision tree, and multiple decision tree composition random forest, is called Random Forest model.
Step S103, obtains the data having label of preset format.The data of preset format refer to the data with same alike result.These data have many.If do not have first to set up Random Forest model, obtain the data having label of preset format.Such as, in social security medical insurance, having the transaction data of label to be to have distinguished is the transaction data of white sample (data of arm's length transaction) and black sample (data of abnormal exchanges).
Step S104, the data importing model training of label that has of preset format is generated the first Random Forest model, and this first Random Forest model comprises T decision tree.Random Forest model can be used to classification, and in machine learning, random forest disaggregated model is a sorter comprising multiple decision tree, and the sum that its classification results exported is the classification results exported by single decision tree is determined.Specifically, the basic thought of random forest classification is: concentrate from original sample randomly draw K sample set with putting back to, and the sample size of each sample set is the same with original sample collection; Set up K decision-tree model respectively to K sample set, each decision tree has a ticket right to choose to carry out selection sort result, obtains K kind classification results; Vote to determine that it is finally classified to each sample according to K kind classification results.The process that random forest generates is exactly the process of training each decision tree.(1) puts back to ground Stochastic choice k sample to train the process of each decision tree to comprise the steps:, trains a decision tree with this k sample; (2) each sample has M attribute, when needing split vertexes in decision tree, a random selecting m attribute from this M attribute, in general m<<M, then adopts certain policy selection best attributes as the Split Attribute of present node from this m attribute; (3) division of each node of decision tree is carried out according to step (2), until can not divide.If random forest is expressed as { h (X, Θ k), k=1 ..., K}, wherein h (X, Θ k) represent decision tree, the decision tree number that K comprises for random forest.Θ kit is an independent identically distributed sequence of random variables, its randomness has following 2 random thoughts to determine: (1) has randomly draw K the training sample set onesize with former sample set with putting back to from former sample set X, and each training sample set constructs a corresponding decision tree; (2) when dividing each node of decision tree, randomly draw a casual subset from whole attribute equal probability, then from this subset, select an optimum attributes to carry out split vertexes.Reduce the correlativity between each decision tree to a great extent with the randomness entering forest, improve classification performance.
Step S103-S104 is to generate the first Random Forest model.The first Random Forest model generated can carry out Tag Estimation to not having the data of label, dopes and does not have the data of label to be the data of arm's length transaction or the data of abnormal exchanges.Carry out Tag Estimation to there is no the data of label to be not having the data of classifying to classify.
Step S105, carries out Tag Estimation according to the data of label that do not have of the first Random Forest model set up to preset format, and preserves the result of prediction.Tag Estimation is classification prediction, to not having the data of classifying to classify, and preserves the result of classification.
Step S106, according to the data obtaining and have true tag that predict the outcome.Not necessarily all correct by the result of Random Forest model Tag Estimation, also likely can partly make mistakes, the label of the data that fractional prediction is made mistakes can be come by artificial or additive method amendment according to the truth of transaction.If the result of Tag Estimation is all correct, then the result of Tag Estimation is the data with true tag; If the result of Tag Estimation does not detect the data of incorrect prediction, then the result of Tag Estimation is also considered as the data with true tag.
Step S107, extracts K sample set, wherein K<T with putting back to from the data with true tag.After the real label of data is determined, extract K sample set with putting back to from the data with true tag, preferably, K is between (1/20) T ~ (1/10) T.
Step S108, generates K decision tree according to K sample set.Wherein, each Decision Tree Construction, as described in step S104, is not repeating here.Every tree is all trained, obtains K decision tree.
Step S109, forms the second Random Forest model by the first Random Forest model together with K decision tree, carries out Tag Estimation by the second Random Forest model to the data with true tag, and preserves the result of prediction.
Step S110, calculates each decision tree information in the second Random Forest model respectively, comprises the integrated performance index of each decision tree according to the result of Tag Estimation.The performance index of decision tree include but not limited to accuracy rate, coverage rate etc.Wherein, accuracy rate refers in total sample set the ratio that label result is correct of predicting, coverage rate refers to that the black sample of prediction accounts for the ratio of total black sample, or the white sample of prediction accounts for the ratio of total white sample, and the black sample in the present embodiment for predicting accounts for the ratio of total black sample.Preferably, integrated performance index=a* accuracy rate+b* coverage rate, wherein, a+b=1, a=b=1/2.In certain embodiments, integrated performance index only includes accuracy rate, or integrated performance index only includes coverage rate.In other some embodiments, the rise time of decision tree can be used as weight and also has an impact to integrated performance index, and the weight as the decision tree closest to current time is greater than the weight of long decision tree of being separated by from current time.
Step S111, carries out sequence according to integrated performance index order from big to small to each decision tree in the second random forest and obtains ranking results.
Step S112, deletes in ranking results and is positioned at a last K decision tree, using a not deleted T decision tree as the first Random Forest model.This first Random Forest model is the first Random Forest model after renewal.It should be noted that, a K mentioned here decision tree is different from the decision tree of the K in step S108, step S109, a K mentioned here decision tree is K decision tree minimum by the integrated performance index after sequence in the second Random Forest model, K the decision tree obtained by K sample set not in step S108, step S109.A T mentioned here decision tree is not T decision tree in step S102, step S104, but by T the decision tree that the integrated performance index after sequence is the highest in the second Random Forest model, is T decision tree after upgrading.
Step S107-S112 is the flow process of adaptive model, namely upgrades the flow process of the first Random Forest model.Upgrade the first Random Forest model and can adapt to new data variation to make this first Random Forest model, the result that namely result of Tag Estimation is classified is more accurate.When there being new data to process, also use the same method Renewal model.
In other implementations, the data (do not have label or have the data of label) of acquisition may not be preset format, need acquisition data after by these data through pre-service, be converted to the data with preset format.As shown in Figure 2, this preprocess method comprises the steps the process flow diagram of preprocess method.
Step S201, obtains data.These data have many, and these data have the data of arm's length transaction, also have the data of abnormal exchanges.In the present embodiment, the label of data has two kinds, is arm's length transaction and abnormal exchanges respectively.The data of arm's length transaction are white sample, and the data of abnormal exchanges are black sample.Such as, in social security medical insurance, data have normal person to use social security medical insurance to see a doctor the data of buying medicine, and offender also may be had to use social security medical insurance to see a doctor the transaction data buying medicine.Social security medical insurance transaction data does not distinguish arm's length transaction or abnormal exchanges not have label to refer to, social security medical insurance transaction data has distinguished arm's length transaction or abnormal exchanges to have label to refer to.The transaction data obtained is see a doctor the transaction data buying medicine at every turn, and the attribute of this transaction data includes but not limited to that name, age, sex, the time to diagnose the illness, single use the social security medical insurance amount of money, the cause of disease, use medicine, area, label etc.For the data not having label, the property value of label is empty.
Step S202, gathers the attribute of these data, and processes abnormal data.Abnormal data comprises the data etc. that the data, the important attribute value that do not meet rule are empty data, form is not identical.The disposal route of abnormal data includes but not limited to the rule treatments, missing values process, format conversion etc. that use minimax.Particularly, in social security medical insurance, the rule of minimax, as " age " property value in transaction data, if 200, does not meet rule, filters out such sample; Missing values process, if " name " property value in transaction data is 0, can filter out such sample; Format conversion, if " using the social security medical insurance amount of money " of a part of transaction data is in units of dividing, the unit unification of transaction data in units of angle, then processes transaction data to facilitate by another part transaction data " using the social security medical insurance amount of money ".
Step S203, extracts the attribute set useful to data processing according to this data attribute, and the property value extracted corresponding to this attribute set forms the data of preset format.Preset format refers to the data with same alike result.In certain embodiments, the method extracted includes but not limited to that the property value of multiple same alike results of many data is carried out statistics is merged into data etc., as in social security medical insurance, add up same person within certain period and see a doctor number of times, use social security medical insurance total charge etc.Attribute set includes but not limited to name, age, sex, sees a doctor number of times, uses social security medical insurance total charge, the cause of disease, use medicine, area, label etc.For the data not having label, the attributive character value of label is empty.
Fig. 3 is the functional block diagram of data processing equipment.This device comprises acquisition module 11, judge module 12, prediction module 13, sample set abstraction module 14, decision tree generation module 15, computing module 16, order module 17, removing module 18.
Acquisition module 11, for obtaining the data not having label of preset format.The data of preset format refer to the data with same alike result.These data have many, have the data of arm's length transaction, also have the data of abnormal exchanges.In the present embodiment, represent whether data are arm's length transaction or abnormal exchanges with label.The label of data comprises arm's length transaction and abnormal exchanges two kinds of labels.The data of arm's length transaction are white sample, and the data of abnormal exchanges are black sample.In other implementations, the label of data may have multiple.Be described for social security medical insurance transaction data in the present embodiment.Such as, in social security medical insurance transaction data, there is normal person to use social security medical insurance to see a doctor the data of buying medicine, offender also may be had to use social security medical insurance to see a doctor the transaction data buying medicine.The label of data is arm's length transaction and abnormal exchanges respectively.Do not have the data of label refer to social security medical insurance transaction data as broad as long be white sample (data of arm's length transaction) or black sample (data of abnormal exchanges).
Judge module 12, set up the first Random Forest model for judging whether, this first Random Forest model comprises T decision tree.Wherein, T decision tree is expressed as multiple decision tree, and multiple decision tree composition random forest, is called Random Forest model.
Acquisition module 11, also for obtaining the data having label of preset format.The data of preset format refer to the data with same alike result.These data have many.If do not have first to set up Random Forest model, obtain the data having label of preset format.Such as, in social security medical insurance, having the transaction data of label to be to have distinguished is the transaction data of white sample (data of arm's length transaction) and black sample (data of abnormal exchanges).
Decision tree generation module 15, for the data importing model training of label that has of preset format is generated the first Random Forest model, this first Random Forest model comprises T decision tree.Random Forest model can be used to classification, and in machine learning, random forest disaggregated model is a sorter comprising multiple decision tree, and the sum that its classification results exported is the classification results exported by single decision tree is determined.
Prediction module 13, if for setting up the first Random Forest model, carried out Tag Estimation according to the data of label that do not have of the first Random Forest model set up to preset format, and preserved the result of prediction.Tag Estimation is classification prediction, to not having the data of classifying to classify, and preserves the result of classification.
Acquisition module 11, also for having the data of true tag according to the acquisition that predicts the outcome.Not necessarily all correct by the result of the first Random Forest model Tag Estimation, also likely can partly make mistakes, the label of the data that fractional prediction is made mistakes can be come by artificial or additive method amendment according to the truth of transaction.If the result of Tag Estimation is all correct, then the result of Tag Estimation is the data with true tag; If the result of Tag Estimation does not detect the data of incorrect prediction, then the result of Tag Estimation is also considered as the data with true tag.
Sample set abstraction module 14, for extracting K sample set, wherein K<T from the data with true tag with putting back to.After the real label of data is determined, extract K sample set with putting back to from the data with true tag, preferably, K is between (1/20) T ~ (1/10) T.
Decision tree generation module 15, also for setting up K decision tree according to K sample set.
Prediction module 13, also for the first Random Forest model is formed the second Random Forest model together with K decision tree, carries out Tag Estimation by the second Random Forest model to the data with true tag, and preserves the result of prediction.
Computing module 16, for calculating each decision tree information in the second Random Forest model respectively according to the result of Tag Estimation, comprises the integrated performance index of each decision tree.The performance index of decision tree include but not limited to accuracy rate, coverage rate etc.Wherein, accuracy rate refers in total sample set the ratio that label result is correct of predicting, coverage rate refers to that the black sample of prediction accounts for the ratio of total black sample, or the white sample of prediction accounts for the ratio of total white sample, and the black sample in the present embodiment for predicting accounts for the ratio of total black sample.Preferably, integrated performance index=a* accuracy rate+b* coverage rate, wherein, a+b=1, a=b=1/2.In certain embodiments, integrated performance index only includes accuracy rate, or integrated performance index only includes coverage rate.In other some embodiments, the rise time of decision tree can be used as weight and also has an impact to integrated performance index, and the weight as the decision tree closest to current time is greater than the weight of long decision tree of being separated by from current time.
Order module 17, obtains ranking results for carrying out sequence according to integrated performance index order from big to small to each decision tree in the second random forest.
Removing module 18, is positioned at a last K decision tree for deleting in ranking results, using a not deleted T decision tree as the first Random Forest model.This first Random Forest model is the first Random Forest model after renewal.It should be noted that, a K mentioned here decision tree is different from the decision tree of the K in decision tree generation module and prediction module, a K mentioned here decision tree is by minimum K the decision tree of integrated performance index after sequence in the second Random Forest model, is not K the decision tree obtained by K sample set in decision tree generation module and prediction module.A T mentioned here decision tree is not T decision tree in acquisition module and judge module, but by T the decision tree that the integrated performance index after sequence is the highest in the second Random Forest model, is T decision tree after renewal.
In other embodiments, the data of acquisition may not be preset format, and acquisition module 11 also has the function these data being converted to the data with preset format.Particularly, as shown in Figure 4, acquisition module 11 comprises acquiring unit 111, pretreatment unit 112.
Acquiring unit 111, for obtaining data.These data have many, and these data have the data of arm's length transaction, also have the data of abnormal exchanges.In the present embodiment, the label of data has two kinds, is arm's length transaction and abnormal exchanges respectively.The data of arm's length transaction are white sample, and the data of abnormal exchanges are black sample.Such as, in social security medical insurance, data have normal person to use social security medical insurance to see a doctor the data of buying medicine, and offender also may be had to use social security medical insurance to see a doctor the transaction data buying medicine.Social security medical insurance transaction data does not distinguish arm's length transaction or abnormal exchanges not have label to refer to, social security medical insurance transaction data has distinguished arm's length transaction or abnormal exchanges to have label to refer to.The transaction data obtained is see a doctor the transaction data buying medicine at every turn, and the attribute of this transaction data includes but not limited to that name, age, sex, the time to diagnose the illness, single use the social security medical insurance amount of money, the cause of disease, use medicine, area, label etc.For the data not having label, the property value of label is empty.
Pretreatment unit 112, for gathering the attribute of these data, and processes abnormal data.Abnormal data comprises the data etc. that the data, the important attribute value that do not meet rule are empty data, form is not identical.The disposal route of abnormal data includes but not limited to the rule treatments, missing values process, format conversion etc. that use minimax.Particularly, in social security medical insurance, the rule of minimax, as " age " property value in transaction data, if 200, does not meet rule, filters out such sample; Missing values process, if " name " property value in transaction data is 0, can filter out such sample; Format conversion, if " using the social security medical insurance amount of money " of a part of transaction data is in units of dividing, the unit unification of transaction data in units of angle, then processes transaction data to facilitate by another part transaction data " using the social security medical insurance amount of money ".
Pretreatment unit 112, also for extracting the attribute set useful to data processing according to this data attribute, and the property value extracted corresponding to this attribute set forms the data of preset format.Preset format refers to the data with same alike result.In certain embodiments, the method extracted includes but not limited to that the property value of multiple same alike results of many data is carried out statistics is merged into data etc., as in social security medical insurance, add up same person within certain period and see a doctor number of times, use social security medical insurance total charge etc.Attribute set includes but not limited to name, age, sex, sees a doctor number of times, uses social security medical insurance total charge, the cause of disease, use medicine, area, label etc.For the data not having label, the attributive character value of label is empty.
The method and apparatus of above-mentioned data processing, with the first Random Forest model, Tag Estimation is carried out to data, see that data are normal data or improper data, after the real label of data is determined, the first Random Forest model is upgraded by the data with true tag, wherein, the first Random Forest model comprises T decision tree.Renewal model extracts a K sample set from the data with true tag, set up K decision tree, the first Random Forest model is utilized to form the second Random Forest model together with K decision tree, Tag Estimation is carried out to the data with true tag, calculate the integrated performance index of each decision tree in the second Random Forest model according to the result of Tag Estimation, choose the highest T of an integrated performance index decision tree as the first Random Forest model.This first Random Forest model carries out after namely Tag Estimation classify to data, and upgrade the first Random Forest model by sorted data, therefore the first Random Forest model is always all in renewal, improves the accuracy of Data classification.Said method and device may be used in social security medical insurance field, improve the prevention and control ability of social security medical insurance arbitrage.The method and device do not need manual intervention always, have saved cost after disposing in the service period of model.
The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, all any amendments done within the spirit and principles in the present invention, equivalent replacement and improvement etc., all should be included within protection scope of the present invention.

Claims (10)

1. a data processing method, is characterized in that: the method comprises the steps,
Obtain the data not having label of preset format;
Judge whether to set up the first Random Forest model, described first Random Forest model comprises T decision tree;
If set up the first Random Forest model, carried out Tag Estimation according to the data of label that do not have of described first Random Forest model to described preset format, and preserved the result of Tag Estimation;
The data with true tag are obtained according to described Tag Estimation result;
Extract K sample set, wherein K<T with putting back to from the data with true tag;
K decision tree is set up according to a described K sample set;
Described first Random Forest model and a described K decision tree form the second Random Forest model, carry out Tag Estimation by described second Random Forest model to the described data with true tag;
Calculate each decision tree information in described second Random Forest model respectively according to the result of described Tag Estimation, wherein, described each decision tree information comprises the integrated performance index of each decision tree; And
Delete the minimum K of a described integrated performance index decision tree, using a not deleted T decision tree as the first Random Forest model.
2. the method for claim 1, is characterized in that: the integrated performance index of each decision tree comprises accuracy rate and/or coverage rate.
3. the method for claim 1, it is characterized in that: calculate each decision tree information in described second Random Forest model respectively according to the result of described Tag Estimation, wherein, after described each decision tree information comprises the step of the integrated performance index of each decision tree, described method also comprises:
According to described integrated performance index, sequence is carried out to each decision tree in described second Random Forest model and obtain ranking results;
Delete K the decision tree that described in described ranking results, integrated performance index is minimum, using a not deleted T decision tree as the first Random Forest model.
4. the method for claim 1, it is characterized in that: judge whether to set up the first Random Forest model, after described first Random Forest model comprises the step of T decision tree, described method also comprises: if do not set up the first Random Forest model, obtains the data having label of preset format; Generate the first Random Forest model to the data of the label execution random forests algorithm that has of described preset format, described first Random Forest model comprises T decision tree.
5. method as claimed in claim 4, is characterized in that: the step of the data of label that has of label or acquisition prediction form that do not have obtaining prediction form comprises the steps,
Obtain data;
The attribute of described data is gathered, and abnormal data is processed;
Extract the attribute set useful to data processing according to described data attribute, and the property value extracted corresponding to described attribute set forms the data of described prediction form.
6. a data processing equipment, is characterized in that: described device comprises acquisition module, judge module, prediction module, sample set abstraction module, decision tree generation module, computing module, removing module;
Described acquisition module, for obtaining the data not having label of preset format;
Described judge module, set up the first Random Forest model for judging whether, described first Random Forest model comprises T decision tree;
Described prediction module, if for setting up the first Random Forest model, carrying out Tag Estimation according to the data of label that do not have of described first Random Forest model to described preset format, and preserving the result of Tag Estimation;
Described acquisition module, for obtaining the data with true tag according to described Tag Estimation result;
Described sample set abstraction module, for extracting K sample set, wherein K<T from the data with true tag with putting back to;
Described decision tree generation module, for setting up K decision tree according to a described K sample set;
Described prediction module, also forms the second Random Forest model for described first Random Forest model and a described K decision tree, carries out Tag Estimation by described second Random Forest model to the described data with true tag;
Described computing module, for calculating each decision tree information in described second Random Forest model respectively according to the result of described Tag Estimation, comprises the integrated performance index of each decision tree;
Described removing module, for deleting the minimum K of a described integrated performance index decision tree, using a not deleted T decision tree as the first Random Forest model.
7. device as claimed in claim 6, is characterized in that: the integrated performance index of decision tree comprises accuracy rate and/or coverage rate.
8. device as claimed in claim 6, is characterized in that: described device also comprises order module, and described order module is used for carrying out sequence according to described integrated performance index to each decision tree in described second Random Forest model and obtains ranking results; Described removing module also for deleting K the decision tree that described in described ranking results, integrated performance index is minimum, using a not deleted T decision tree as the first Random Forest model.
9. device as claimed in claim 6, is characterized in that:
Described acquisition module, if also for not setting up the first Random Forest model, obtains the data having label of preset format;
Described decision tree generation module, also for generating the first Random Forest model to the data of the label execution random forests algorithm that has of described preset format, described first Random Forest model comprises T decision tree.
10. device as claimed in claim 9, is characterized in that: described acquisition module comprises acquiring unit, pretreatment unit,
Described acquiring unit, for obtaining data;
Described pretreatment unit, for gathering the attribute of described data, and processes abnormal data; Described pretreatment unit is also for extracting the attribute set useful to data processing to according to described data attribute, and the property value extracted corresponding to described attribute set forms the data of described prediction form.
CN201510943565.8A 2015-12-16 2015-12-16 Data processing method and device Pending CN105574544A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510943565.8A CN105574544A (en) 2015-12-16 2015-12-16 Data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510943565.8A CN105574544A (en) 2015-12-16 2015-12-16 Data processing method and device

Publications (1)

Publication Number Publication Date
CN105574544A true CN105574544A (en) 2016-05-11

Family

ID=55884650

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510943565.8A Pending CN105574544A (en) 2015-12-16 2015-12-16 Data processing method and device

Country Status (1)

Country Link
CN (1) CN105574544A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156315A (en) * 2016-07-01 2016-11-23 中国人民解放军装备学院 A kind of data quality monitoring method judged based on disaggregated model
CN109446017A (en) * 2018-09-03 2019-03-08 平安科技(深圳)有限公司 A kind of alarm algorithm generation method, monitoring system and terminal device
CN109544150A (en) * 2018-10-09 2019-03-29 阿里巴巴集团控股有限公司 A kind of method of generating classification model and device calculate equipment and storage medium
CN109615538A (en) * 2018-12-13 2019-04-12 平安医疗健康管理股份有限公司 Social security violation detection method, device, equipment and computer storage medium
CN111242793A (en) * 2020-01-16 2020-06-05 上海金仕达卫宁软件科技有限公司 Method and device for detecting medical insurance data abnormity
CN111291896A (en) * 2020-02-03 2020-06-16 深圳前海微众银行股份有限公司 Interactive random forest subtree screening method, device, equipment and readable medium
CN111967229A (en) * 2020-09-01 2020-11-20 申建常 Efficient label type data analysis method and analysis system
CN112651439A (en) * 2020-12-25 2021-04-13 平安科技(深圳)有限公司 Material classification method and device, computer equipment and storage medium
CN113038302A (en) * 2019-12-25 2021-06-25 中国电信股份有限公司 Flow prediction method and device and computer storage medium
CN113762394A (en) * 2021-09-09 2021-12-07 昆明理工大学 Blasting block size prediction method
CN114913372A (en) * 2022-05-10 2022-08-16 电子科技大学 Target recognition algorithm based on multi-mode data integration decision
CN109919197B (en) * 2019-02-13 2023-07-21 创新先进技术有限公司 Random forest model training method and device
CN113762394B (en) * 2021-09-09 2024-04-26 昆明理工大学 Blasting block prediction method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101251851A (en) * 2008-02-29 2008-08-27 吉林大学 Multi-classifier integrating method based on increment native Bayes network
CN104391970A (en) * 2014-12-04 2015-03-04 深圳先进技术研究院 Attribute subspace weighted random forest data processing method
CN104636814A (en) * 2013-11-14 2015-05-20 中国科学院深圳先进技术研究院 Method and system for optimizing random forest models

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101251851A (en) * 2008-02-29 2008-08-27 吉林大学 Multi-classifier integrating method based on increment native Bayes network
CN104636814A (en) * 2013-11-14 2015-05-20 中国科学院深圳先进技术研究院 Method and system for optimizing random forest models
CN104391970A (en) * 2014-12-04 2015-03-04 深圳先进技术研究院 Attribute subspace weighted random forest data processing method

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156315A (en) * 2016-07-01 2016-11-23 中国人民解放军装备学院 A kind of data quality monitoring method judged based on disaggregated model
CN106156315B (en) * 2016-07-01 2019-05-17 中国人民解放军装备学院 A kind of data quality monitoring method based on disaggregated model judgement
CN109446017A (en) * 2018-09-03 2019-03-08 平安科技(深圳)有限公司 A kind of alarm algorithm generation method, monitoring system and terminal device
CN109544150A (en) * 2018-10-09 2019-03-29 阿里巴巴集团控股有限公司 A kind of method of generating classification model and device calculate equipment and storage medium
CN109615538A (en) * 2018-12-13 2019-04-12 平安医疗健康管理股份有限公司 Social security violation detection method, device, equipment and computer storage medium
CN109919197B (en) * 2019-02-13 2023-07-21 创新先进技术有限公司 Random forest model training method and device
CN113038302A (en) * 2019-12-25 2021-06-25 中国电信股份有限公司 Flow prediction method and device and computer storage medium
CN113038302B (en) * 2019-12-25 2022-09-30 中国电信股份有限公司 Flow prediction method and device and computer storage medium
CN111242793B (en) * 2020-01-16 2024-02-06 上海金仕达卫宁软件科技有限公司 Medical insurance data abnormality detection method and device
CN111242793A (en) * 2020-01-16 2020-06-05 上海金仕达卫宁软件科技有限公司 Method and device for detecting medical insurance data abnormity
CN111291896A (en) * 2020-02-03 2020-06-16 深圳前海微众银行股份有限公司 Interactive random forest subtree screening method, device, equipment and readable medium
CN111291896B (en) * 2020-02-03 2022-02-01 深圳前海微众银行股份有限公司 Interactive random forest subtree screening method, device, equipment and readable medium
CN111967229A (en) * 2020-09-01 2020-11-20 申建常 Efficient label type data analysis method and analysis system
CN112651439A (en) * 2020-12-25 2021-04-13 平安科技(深圳)有限公司 Material classification method and device, computer equipment and storage medium
CN112651439B (en) * 2020-12-25 2023-12-22 平安科技(深圳)有限公司 Material classification method, device, computer equipment and storage medium
CN113762394A (en) * 2021-09-09 2021-12-07 昆明理工大学 Blasting block size prediction method
CN113762394B (en) * 2021-09-09 2024-04-26 昆明理工大学 Blasting block prediction method
CN114913372B (en) * 2022-05-10 2023-05-26 电子科技大学 Target recognition method based on multi-mode data integration decision
CN114913372A (en) * 2022-05-10 2022-08-16 电子科技大学 Target recognition algorithm based on multi-mode data integration decision

Similar Documents

Publication Publication Date Title
CN105574544A (en) Data processing method and device
CN110223168B (en) Label propagation anti-fraud detection method and system based on enterprise relationship map
CN103793484B (en) The fraud identifying system based on machine learning in classification information website
CN108629413A (en) Neural network model training, trading activity Risk Identification Method and device
CN104050556B (en) The feature selection approach and its detection method of a kind of spam
CN103092975A (en) Detection and filter method of network community garbage information based on topic consensus coverage rate
CN108021651A (en) Network public opinion risk assessment method and device
CN109818961A (en) A kind of network inbreak detection method, device and equipment
CN110909195A (en) Picture labeling method and device based on block chain, storage medium and server
CN107368856A (en) Clustering method and device, the computer installation and readable storage medium storing program for executing of Malware
CN111538741B (en) Deep learning analysis method and system for big data of alarm condition
CN110310114A (en) Object classification method, device, server and storage medium
CN115577701B (en) Risk behavior identification method, device, equipment and medium aiming at big data security
CN107003992A (en) Perception associative memory for neural language performance identifying system
CN112329816A (en) Data classification method and device, electronic equipment and readable storage medium
CN107679135A (en) The topic detection of network-oriented text big data and tracking, device
CN104504334B (en) System and method for assessing classifying rules selectivity
CN109191210A (en) A kind of broadband target user&#39;s recognition methods based on Adaboost algorithm
CN110458296A (en) The labeling method and device of object event, storage medium and electronic device
CN112580902A (en) Object data processing method and device, computer equipment and storage medium
CN115794803A (en) Engineering audit problem monitoring method and system based on big data AI technology
CN104102730B (en) Known label-based big data normal mode extracting method and system
CN112990989B (en) Value prediction model input data generation method, device, equipment and medium
CN112217908B (en) Information pushing method and device based on transfer learning and computer equipment
CN111784360B (en) Anti-fraud prediction method and system based on network link backtracking

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160511