CN105574544A - Data processing method and device - Google Patents
Data processing method and device Download PDFInfo
- Publication number
- CN105574544A CN105574544A CN201510943565.8A CN201510943565A CN105574544A CN 105574544 A CN105574544 A CN 105574544A CN 201510943565 A CN201510943565 A CN 201510943565A CN 105574544 A CN105574544 A CN 105574544A
- Authority
- CN
- China
- Prior art keywords
- data
- random forest
- decision tree
- forest model
- module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 12
- 238000003066 decision tree Methods 0.000 claims abstract description 137
- 238000007637 random forest analysis Methods 0.000 claims abstract description 130
- 238000000034 method Methods 0.000 claims abstract description 40
- 230000002159 abnormal effect Effects 0.000 claims description 35
- 239000000284 extract Substances 0.000 claims description 8
- 238000004422 calculation algorithm Methods 0.000 claims 2
- 239000003814 drug Substances 0.000 description 14
- 238000010586 diagram Methods 0.000 description 7
- 238000006243 chemical reaction Methods 0.000 description 4
- 201000010099 disease Diseases 0.000 description 4
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 4
- 239000000203 mixture Substances 0.000 description 3
- 239000000654 additive Substances 0.000 description 2
- 230000000996 additive effect Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 230000002265 prevention Effects 0.000 description 2
- 238000011282 treatment Methods 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 239000012141 concentrate Substances 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 230000003111 delayed effect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/211—Selection of the most significant subset of features
Abstract
The invention provides a data processing method and device. The method comprises the following steps of training data with labels in a predetermined format to obtain a first random forest model, wherein the first random forest model comprises T decision trees; carrying out label forecast, namely classification on data without labels in the predetermined format according to the first random forest model; extracting K sample sets from data with real labels after the data acquires the real labels; building K decision trees; combining the first random forest model and the K decision trees into a second random forest model for forecasting the data with the real labels; and calculating a comprehensive performance index of each decision tree in the second random forest model, and taking the T decision trees with the highest comprehensive performance index as the first random forest model. By the method, the classification accuracy of the data can be improved. The invention also provides a data processing device.
Description
Technical field
The present invention relates to data processing field, particularly relate to a kind of data processing method and device.
Background technology
At large data age, data are widely used in a lot of field.From a large amount of data, how distinguishing which data is more accurately normal data, and which data is improper data, has become more and more important.Such as, in social security medical insurance field, illegal social security medical insurance arbitrage grows in intensity, and traditional detection arbitrage is based on simple rule or off-line model.Along with the increase of data volume, naive model or off-line model processing power decline, and a lot of offender can attempt walking around rule and model, causes the delayed of social security medicare system prevention and control, be difficult to judge that social security medical insurance transaction is arm's length transaction or abnormal exchanges accurately, as arbitrage transaction.If social security medicare system accurately cannot judge social security medical insurance, transaction belongs to arm's length transaction or abnormal exchanges, will constantly occur arbitrage case, affect the stable operation of social security medical insurance fund.As can be seen here, how accurately to distinguish the classification of data, whether as normal in data or improper be the problem always needing to solve.
Summary of the invention
The invention provides a kind of data processing method improving Data classification accuracy.
In addition, the present invention also provides a kind of device using above-mentioned data processing method.
A kind of data processing method, the method comprises:
Obtain the data not having label of preset format;
Judge whether to set up the first Random Forest model, described first Random Forest model comprises T decision tree;
If set up the first Random Forest model, carried out Tag Estimation according to the data of label that do not have of described first Random Forest model to described preset format, and preserved the result of Tag Estimation;
The data with true tag are obtained according to described Tag Estimation result;
Extract K sample set, wherein K<T with putting back to from the data with true tag;
K decision tree is set up according to a described K sample set;
Described first Random Forest model and a described K decision tree form the second Random Forest model, carry out Tag Estimation by described second Random Forest model to the described data with true tag;
Calculate each decision tree information in described second Random Forest model according to the result of described Tag Estimation respectively, comprise the integrated performance index of each decision tree; And
Delete the minimum K of a described integrated performance index decision tree, using a not deleted T decision tree as the first Random Forest model.
A kind of data processing equipment, this device comprises: acquisition module, judge module, prediction module, sample set abstraction module, decision tree generation module, computing module, removing module;
Described acquisition module, for obtaining the data not having label of preset format;
Described judge module, set up the first Random Forest model for judging whether, described first Random Forest model comprises T decision tree;
Described prediction module, if for setting up the first Random Forest model, carrying out Tag Estimation according to the data of label that do not have of described first Random Forest model to described preset format, and preserving the result of Tag Estimation;
Described acquisition module, for obtaining the data with true tag according to described Tag Estimation result;
Described sample set abstraction module, for extracting K sample set, wherein K<T from the data with true tag with putting back to;
Described decision tree generation module, for setting up K decision tree according to a described K sample set;
Described prediction module, also forms the second Random Forest model for described first Random Forest model and a described K decision tree, carries out Tag Estimation by described second Random Forest model to the described data with true tag;
Described computing module, for calculating each decision tree decision tree information in described second Random Forest model respectively according to the result of described Tag Estimation, comprises the integrated performance index of each decision tree;
Described removing module, deletes the minimum K of a described integrated performance index decision tree, using a not deleted T decision tree as the first Random Forest model.
Above data processing method and device, first utilize the first Random Forest model comprising T decision tree to carry out Tag Estimation to a large amount of data namely to classify, then from the data with true tag, choose a K sample set set up K decision tree according to the true tag of the result determination data of Tag Estimation, then with the second Random Forest model of T decision tree and K decision tree composition, Tag Estimation is carried out to the data with true tag, the integrated performance index in the second Random Forest model is calculated according to the result of Tag Estimation, choose the highest T of an integrated performance index decision tree as the first Random Forest model.First Random Forest model, to not having the data of label to carry out after namely Tag Estimation classify, upgrades the first Random Forest model by sorted data.Because the first Random Forest model upgrades, with the first Random Forest model upgraded, the accuracy that classification can improve classification is carried out to data.
Accompanying drawing explanation
In order to be illustrated more clearly in the concrete scheme in the embodiment of the present invention, below the accompanying drawing used required in describing embodiment is briefly described, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.
Fig. 1 is the process flow diagram of data processing method.
Fig. 2 is the process flow diagram of data preprocessing method.
Fig. 3 is the functional block diagram of data processing equipment.
Fig. 4 is the functional block diagram of acquisition module.
Embodiment
For making the object, technical solutions and advantages of the present invention clearly, below in conjunction with accompanying drawing, the specific embodiment of the present invention is described in further detail.
Fig. 1 is the process flow diagram of data processing method, and this data processing method comprises the steps.
Step S101, obtains the data not having label of preset format.The data of preset format refer to the data with same alike result.These data have many, have the data of arm's length transaction, also have the data of abnormal exchanges.In the present embodiment, represent whether data are arm's length transaction or abnormal exchanges with label.The label of data comprises arm's length transaction and abnormal exchanges two kinds of labels.The data of arm's length transaction are white sample, and the data of abnormal exchanges are black sample.In other implementations, the label of data may have multiple.Be described for social security medical insurance transaction data in the present embodiment.Such as, in social security medical insurance transaction data, there is normal person to use social security medical insurance to see a doctor the data of buying medicine, offender also may be had to use social security medical insurance to see a doctor the transaction data buying medicine.The label of data is arm's length transaction and abnormal exchanges respectively.Do not have the data of label refer to social security medical insurance transaction data as broad as long be white sample (data of arm's length transaction) or black sample (data of abnormal exchanges).
Step S102, judges whether to set up the first Random Forest model, and this first Random Forest model comprises T decision tree.If established the first Random Forest model, perform step S105; If do not set up the first Random Forest model, perform step S103.Wherein, T decision tree is expressed as multiple decision tree, and multiple decision tree composition random forest, is called Random Forest model.
Step S103, obtains the data having label of preset format.The data of preset format refer to the data with same alike result.These data have many.If do not have first to set up Random Forest model, obtain the data having label of preset format.Such as, in social security medical insurance, having the transaction data of label to be to have distinguished is the transaction data of white sample (data of arm's length transaction) and black sample (data of abnormal exchanges).
Step S104, the data importing model training of label that has of preset format is generated the first Random Forest model, and this first Random Forest model comprises T decision tree.Random Forest model can be used to classification, and in machine learning, random forest disaggregated model is a sorter comprising multiple decision tree, and the sum that its classification results exported is the classification results exported by single decision tree is determined.Specifically, the basic thought of random forest classification is: concentrate from original sample randomly draw K sample set with putting back to, and the sample size of each sample set is the same with original sample collection; Set up K decision-tree model respectively to K sample set, each decision tree has a ticket right to choose to carry out selection sort result, obtains K kind classification results; Vote to determine that it is finally classified to each sample according to K kind classification results.The process that random forest generates is exactly the process of training each decision tree.(1) puts back to ground Stochastic choice k sample to train the process of each decision tree to comprise the steps:, trains a decision tree with this k sample; (2) each sample has M attribute, when needing split vertexes in decision tree, a random selecting m attribute from this M attribute, in general m<<M, then adopts certain policy selection best attributes as the Split Attribute of present node from this m attribute; (3) division of each node of decision tree is carried out according to step (2), until can not divide.If random forest is expressed as { h (X, Θ
k), k=1 ..., K}, wherein h (X, Θ
k) represent decision tree, the decision tree number that K comprises for random forest.Θ
kit is an independent identically distributed sequence of random variables, its randomness has following 2 random thoughts to determine: (1) has randomly draw K the training sample set onesize with former sample set with putting back to from former sample set X, and each training sample set constructs a corresponding decision tree; (2) when dividing each node of decision tree, randomly draw a casual subset from whole attribute equal probability, then from this subset, select an optimum attributes to carry out split vertexes.Reduce the correlativity between each decision tree to a great extent with the randomness entering forest, improve classification performance.
Step S103-S104 is to generate the first Random Forest model.The first Random Forest model generated can carry out Tag Estimation to not having the data of label, dopes and does not have the data of label to be the data of arm's length transaction or the data of abnormal exchanges.Carry out Tag Estimation to there is no the data of label to be not having the data of classifying to classify.
Step S105, carries out Tag Estimation according to the data of label that do not have of the first Random Forest model set up to preset format, and preserves the result of prediction.Tag Estimation is classification prediction, to not having the data of classifying to classify, and preserves the result of classification.
Step S106, according to the data obtaining and have true tag that predict the outcome.Not necessarily all correct by the result of Random Forest model Tag Estimation, also likely can partly make mistakes, the label of the data that fractional prediction is made mistakes can be come by artificial or additive method amendment according to the truth of transaction.If the result of Tag Estimation is all correct, then the result of Tag Estimation is the data with true tag; If the result of Tag Estimation does not detect the data of incorrect prediction, then the result of Tag Estimation is also considered as the data with true tag.
Step S107, extracts K sample set, wherein K<T with putting back to from the data with true tag.After the real label of data is determined, extract K sample set with putting back to from the data with true tag, preferably, K is between (1/20) T ~ (1/10) T.
Step S108, generates K decision tree according to K sample set.Wherein, each Decision Tree Construction, as described in step S104, is not repeating here.Every tree is all trained, obtains K decision tree.
Step S109, forms the second Random Forest model by the first Random Forest model together with K decision tree, carries out Tag Estimation by the second Random Forest model to the data with true tag, and preserves the result of prediction.
Step S110, calculates each decision tree information in the second Random Forest model respectively, comprises the integrated performance index of each decision tree according to the result of Tag Estimation.The performance index of decision tree include but not limited to accuracy rate, coverage rate etc.Wherein, accuracy rate refers in total sample set the ratio that label result is correct of predicting, coverage rate refers to that the black sample of prediction accounts for the ratio of total black sample, or the white sample of prediction accounts for the ratio of total white sample, and the black sample in the present embodiment for predicting accounts for the ratio of total black sample.Preferably, integrated performance index=a* accuracy rate+b* coverage rate, wherein, a+b=1, a=b=1/2.In certain embodiments, integrated performance index only includes accuracy rate, or integrated performance index only includes coverage rate.In other some embodiments, the rise time of decision tree can be used as weight and also has an impact to integrated performance index, and the weight as the decision tree closest to current time is greater than the weight of long decision tree of being separated by from current time.
Step S111, carries out sequence according to integrated performance index order from big to small to each decision tree in the second random forest and obtains ranking results.
Step S112, deletes in ranking results and is positioned at a last K decision tree, using a not deleted T decision tree as the first Random Forest model.This first Random Forest model is the first Random Forest model after renewal.It should be noted that, a K mentioned here decision tree is different from the decision tree of the K in step S108, step S109, a K mentioned here decision tree is K decision tree minimum by the integrated performance index after sequence in the second Random Forest model, K the decision tree obtained by K sample set not in step S108, step S109.A T mentioned here decision tree is not T decision tree in step S102, step S104, but by T the decision tree that the integrated performance index after sequence is the highest in the second Random Forest model, is T decision tree after upgrading.
Step S107-S112 is the flow process of adaptive model, namely upgrades the flow process of the first Random Forest model.Upgrade the first Random Forest model and can adapt to new data variation to make this first Random Forest model, the result that namely result of Tag Estimation is classified is more accurate.When there being new data to process, also use the same method Renewal model.
In other implementations, the data (do not have label or have the data of label) of acquisition may not be preset format, need acquisition data after by these data through pre-service, be converted to the data with preset format.As shown in Figure 2, this preprocess method comprises the steps the process flow diagram of preprocess method.
Step S201, obtains data.These data have many, and these data have the data of arm's length transaction, also have the data of abnormal exchanges.In the present embodiment, the label of data has two kinds, is arm's length transaction and abnormal exchanges respectively.The data of arm's length transaction are white sample, and the data of abnormal exchanges are black sample.Such as, in social security medical insurance, data have normal person to use social security medical insurance to see a doctor the data of buying medicine, and offender also may be had to use social security medical insurance to see a doctor the transaction data buying medicine.Social security medical insurance transaction data does not distinguish arm's length transaction or abnormal exchanges not have label to refer to, social security medical insurance transaction data has distinguished arm's length transaction or abnormal exchanges to have label to refer to.The transaction data obtained is see a doctor the transaction data buying medicine at every turn, and the attribute of this transaction data includes but not limited to that name, age, sex, the time to diagnose the illness, single use the social security medical insurance amount of money, the cause of disease, use medicine, area, label etc.For the data not having label, the property value of label is empty.
Step S202, gathers the attribute of these data, and processes abnormal data.Abnormal data comprises the data etc. that the data, the important attribute value that do not meet rule are empty data, form is not identical.The disposal route of abnormal data includes but not limited to the rule treatments, missing values process, format conversion etc. that use minimax.Particularly, in social security medical insurance, the rule of minimax, as " age " property value in transaction data, if 200, does not meet rule, filters out such sample; Missing values process, if " name " property value in transaction data is 0, can filter out such sample; Format conversion, if " using the social security medical insurance amount of money " of a part of transaction data is in units of dividing, the unit unification of transaction data in units of angle, then processes transaction data to facilitate by another part transaction data " using the social security medical insurance amount of money ".
Step S203, extracts the attribute set useful to data processing according to this data attribute, and the property value extracted corresponding to this attribute set forms the data of preset format.Preset format refers to the data with same alike result.In certain embodiments, the method extracted includes but not limited to that the property value of multiple same alike results of many data is carried out statistics is merged into data etc., as in social security medical insurance, add up same person within certain period and see a doctor number of times, use social security medical insurance total charge etc.Attribute set includes but not limited to name, age, sex, sees a doctor number of times, uses social security medical insurance total charge, the cause of disease, use medicine, area, label etc.For the data not having label, the attributive character value of label is empty.
Fig. 3 is the functional block diagram of data processing equipment.This device comprises acquisition module 11, judge module 12, prediction module 13, sample set abstraction module 14, decision tree generation module 15, computing module 16, order module 17, removing module 18.
Acquisition module 11, for obtaining the data not having label of preset format.The data of preset format refer to the data with same alike result.These data have many, have the data of arm's length transaction, also have the data of abnormal exchanges.In the present embodiment, represent whether data are arm's length transaction or abnormal exchanges with label.The label of data comprises arm's length transaction and abnormal exchanges two kinds of labels.The data of arm's length transaction are white sample, and the data of abnormal exchanges are black sample.In other implementations, the label of data may have multiple.Be described for social security medical insurance transaction data in the present embodiment.Such as, in social security medical insurance transaction data, there is normal person to use social security medical insurance to see a doctor the data of buying medicine, offender also may be had to use social security medical insurance to see a doctor the transaction data buying medicine.The label of data is arm's length transaction and abnormal exchanges respectively.Do not have the data of label refer to social security medical insurance transaction data as broad as long be white sample (data of arm's length transaction) or black sample (data of abnormal exchanges).
Judge module 12, set up the first Random Forest model for judging whether, this first Random Forest model comprises T decision tree.Wherein, T decision tree is expressed as multiple decision tree, and multiple decision tree composition random forest, is called Random Forest model.
Acquisition module 11, also for obtaining the data having label of preset format.The data of preset format refer to the data with same alike result.These data have many.If do not have first to set up Random Forest model, obtain the data having label of preset format.Such as, in social security medical insurance, having the transaction data of label to be to have distinguished is the transaction data of white sample (data of arm's length transaction) and black sample (data of abnormal exchanges).
Decision tree generation module 15, for the data importing model training of label that has of preset format is generated the first Random Forest model, this first Random Forest model comprises T decision tree.Random Forest model can be used to classification, and in machine learning, random forest disaggregated model is a sorter comprising multiple decision tree, and the sum that its classification results exported is the classification results exported by single decision tree is determined.
Prediction module 13, if for setting up the first Random Forest model, carried out Tag Estimation according to the data of label that do not have of the first Random Forest model set up to preset format, and preserved the result of prediction.Tag Estimation is classification prediction, to not having the data of classifying to classify, and preserves the result of classification.
Acquisition module 11, also for having the data of true tag according to the acquisition that predicts the outcome.Not necessarily all correct by the result of the first Random Forest model Tag Estimation, also likely can partly make mistakes, the label of the data that fractional prediction is made mistakes can be come by artificial or additive method amendment according to the truth of transaction.If the result of Tag Estimation is all correct, then the result of Tag Estimation is the data with true tag; If the result of Tag Estimation does not detect the data of incorrect prediction, then the result of Tag Estimation is also considered as the data with true tag.
Sample set abstraction module 14, for extracting K sample set, wherein K<T from the data with true tag with putting back to.After the real label of data is determined, extract K sample set with putting back to from the data with true tag, preferably, K is between (1/20) T ~ (1/10) T.
Decision tree generation module 15, also for setting up K decision tree according to K sample set.
Prediction module 13, also for the first Random Forest model is formed the second Random Forest model together with K decision tree, carries out Tag Estimation by the second Random Forest model to the data with true tag, and preserves the result of prediction.
Computing module 16, for calculating each decision tree information in the second Random Forest model respectively according to the result of Tag Estimation, comprises the integrated performance index of each decision tree.The performance index of decision tree include but not limited to accuracy rate, coverage rate etc.Wherein, accuracy rate refers in total sample set the ratio that label result is correct of predicting, coverage rate refers to that the black sample of prediction accounts for the ratio of total black sample, or the white sample of prediction accounts for the ratio of total white sample, and the black sample in the present embodiment for predicting accounts for the ratio of total black sample.Preferably, integrated performance index=a* accuracy rate+b* coverage rate, wherein, a+b=1, a=b=1/2.In certain embodiments, integrated performance index only includes accuracy rate, or integrated performance index only includes coverage rate.In other some embodiments, the rise time of decision tree can be used as weight and also has an impact to integrated performance index, and the weight as the decision tree closest to current time is greater than the weight of long decision tree of being separated by from current time.
Order module 17, obtains ranking results for carrying out sequence according to integrated performance index order from big to small to each decision tree in the second random forest.
Removing module 18, is positioned at a last K decision tree for deleting in ranking results, using a not deleted T decision tree as the first Random Forest model.This first Random Forest model is the first Random Forest model after renewal.It should be noted that, a K mentioned here decision tree is different from the decision tree of the K in decision tree generation module and prediction module, a K mentioned here decision tree is by minimum K the decision tree of integrated performance index after sequence in the second Random Forest model, is not K the decision tree obtained by K sample set in decision tree generation module and prediction module.A T mentioned here decision tree is not T decision tree in acquisition module and judge module, but by T the decision tree that the integrated performance index after sequence is the highest in the second Random Forest model, is T decision tree after renewal.
In other embodiments, the data of acquisition may not be preset format, and acquisition module 11 also has the function these data being converted to the data with preset format.Particularly, as shown in Figure 4, acquisition module 11 comprises acquiring unit 111, pretreatment unit 112.
Acquiring unit 111, for obtaining data.These data have many, and these data have the data of arm's length transaction, also have the data of abnormal exchanges.In the present embodiment, the label of data has two kinds, is arm's length transaction and abnormal exchanges respectively.The data of arm's length transaction are white sample, and the data of abnormal exchanges are black sample.Such as, in social security medical insurance, data have normal person to use social security medical insurance to see a doctor the data of buying medicine, and offender also may be had to use social security medical insurance to see a doctor the transaction data buying medicine.Social security medical insurance transaction data does not distinguish arm's length transaction or abnormal exchanges not have label to refer to, social security medical insurance transaction data has distinguished arm's length transaction or abnormal exchanges to have label to refer to.The transaction data obtained is see a doctor the transaction data buying medicine at every turn, and the attribute of this transaction data includes but not limited to that name, age, sex, the time to diagnose the illness, single use the social security medical insurance amount of money, the cause of disease, use medicine, area, label etc.For the data not having label, the property value of label is empty.
Pretreatment unit 112, for gathering the attribute of these data, and processes abnormal data.Abnormal data comprises the data etc. that the data, the important attribute value that do not meet rule are empty data, form is not identical.The disposal route of abnormal data includes but not limited to the rule treatments, missing values process, format conversion etc. that use minimax.Particularly, in social security medical insurance, the rule of minimax, as " age " property value in transaction data, if 200, does not meet rule, filters out such sample; Missing values process, if " name " property value in transaction data is 0, can filter out such sample; Format conversion, if " using the social security medical insurance amount of money " of a part of transaction data is in units of dividing, the unit unification of transaction data in units of angle, then processes transaction data to facilitate by another part transaction data " using the social security medical insurance amount of money ".
Pretreatment unit 112, also for extracting the attribute set useful to data processing according to this data attribute, and the property value extracted corresponding to this attribute set forms the data of preset format.Preset format refers to the data with same alike result.In certain embodiments, the method extracted includes but not limited to that the property value of multiple same alike results of many data is carried out statistics is merged into data etc., as in social security medical insurance, add up same person within certain period and see a doctor number of times, use social security medical insurance total charge etc.Attribute set includes but not limited to name, age, sex, sees a doctor number of times, uses social security medical insurance total charge, the cause of disease, use medicine, area, label etc.For the data not having label, the attributive character value of label is empty.
The method and apparatus of above-mentioned data processing, with the first Random Forest model, Tag Estimation is carried out to data, see that data are normal data or improper data, after the real label of data is determined, the first Random Forest model is upgraded by the data with true tag, wherein, the first Random Forest model comprises T decision tree.Renewal model extracts a K sample set from the data with true tag, set up K decision tree, the first Random Forest model is utilized to form the second Random Forest model together with K decision tree, Tag Estimation is carried out to the data with true tag, calculate the integrated performance index of each decision tree in the second Random Forest model according to the result of Tag Estimation, choose the highest T of an integrated performance index decision tree as the first Random Forest model.This first Random Forest model carries out after namely Tag Estimation classify to data, and upgrade the first Random Forest model by sorted data, therefore the first Random Forest model is always all in renewal, improves the accuracy of Data classification.Said method and device may be used in social security medical insurance field, improve the prevention and control ability of social security medical insurance arbitrage.The method and device do not need manual intervention always, have saved cost after disposing in the service period of model.
The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, all any amendments done within the spirit and principles in the present invention, equivalent replacement and improvement etc., all should be included within protection scope of the present invention.
Claims (10)
1. a data processing method, is characterized in that: the method comprises the steps,
Obtain the data not having label of preset format;
Judge whether to set up the first Random Forest model, described first Random Forest model comprises T decision tree;
If set up the first Random Forest model, carried out Tag Estimation according to the data of label that do not have of described first Random Forest model to described preset format, and preserved the result of Tag Estimation;
The data with true tag are obtained according to described Tag Estimation result;
Extract K sample set, wherein K<T with putting back to from the data with true tag;
K decision tree is set up according to a described K sample set;
Described first Random Forest model and a described K decision tree form the second Random Forest model, carry out Tag Estimation by described second Random Forest model to the described data with true tag;
Calculate each decision tree information in described second Random Forest model respectively according to the result of described Tag Estimation, wherein, described each decision tree information comprises the integrated performance index of each decision tree; And
Delete the minimum K of a described integrated performance index decision tree, using a not deleted T decision tree as the first Random Forest model.
2. the method for claim 1, is characterized in that: the integrated performance index of each decision tree comprises accuracy rate and/or coverage rate.
3. the method for claim 1, it is characterized in that: calculate each decision tree information in described second Random Forest model respectively according to the result of described Tag Estimation, wherein, after described each decision tree information comprises the step of the integrated performance index of each decision tree, described method also comprises:
According to described integrated performance index, sequence is carried out to each decision tree in described second Random Forest model and obtain ranking results;
Delete K the decision tree that described in described ranking results, integrated performance index is minimum, using a not deleted T decision tree as the first Random Forest model.
4. the method for claim 1, it is characterized in that: judge whether to set up the first Random Forest model, after described first Random Forest model comprises the step of T decision tree, described method also comprises: if do not set up the first Random Forest model, obtains the data having label of preset format; Generate the first Random Forest model to the data of the label execution random forests algorithm that has of described preset format, described first Random Forest model comprises T decision tree.
5. method as claimed in claim 4, is characterized in that: the step of the data of label that has of label or acquisition prediction form that do not have obtaining prediction form comprises the steps,
Obtain data;
The attribute of described data is gathered, and abnormal data is processed;
Extract the attribute set useful to data processing according to described data attribute, and the property value extracted corresponding to described attribute set forms the data of described prediction form.
6. a data processing equipment, is characterized in that: described device comprises acquisition module, judge module, prediction module, sample set abstraction module, decision tree generation module, computing module, removing module;
Described acquisition module, for obtaining the data not having label of preset format;
Described judge module, set up the first Random Forest model for judging whether, described first Random Forest model comprises T decision tree;
Described prediction module, if for setting up the first Random Forest model, carrying out Tag Estimation according to the data of label that do not have of described first Random Forest model to described preset format, and preserving the result of Tag Estimation;
Described acquisition module, for obtaining the data with true tag according to described Tag Estimation result;
Described sample set abstraction module, for extracting K sample set, wherein K<T from the data with true tag with putting back to;
Described decision tree generation module, for setting up K decision tree according to a described K sample set;
Described prediction module, also forms the second Random Forest model for described first Random Forest model and a described K decision tree, carries out Tag Estimation by described second Random Forest model to the described data with true tag;
Described computing module, for calculating each decision tree information in described second Random Forest model respectively according to the result of described Tag Estimation, comprises the integrated performance index of each decision tree;
Described removing module, for deleting the minimum K of a described integrated performance index decision tree, using a not deleted T decision tree as the first Random Forest model.
7. device as claimed in claim 6, is characterized in that: the integrated performance index of decision tree comprises accuracy rate and/or coverage rate.
8. device as claimed in claim 6, is characterized in that: described device also comprises order module, and described order module is used for carrying out sequence according to described integrated performance index to each decision tree in described second Random Forest model and obtains ranking results; Described removing module also for deleting K the decision tree that described in described ranking results, integrated performance index is minimum, using a not deleted T decision tree as the first Random Forest model.
9. device as claimed in claim 6, is characterized in that:
Described acquisition module, if also for not setting up the first Random Forest model, obtains the data having label of preset format;
Described decision tree generation module, also for generating the first Random Forest model to the data of the label execution random forests algorithm that has of described preset format, described first Random Forest model comprises T decision tree.
10. device as claimed in claim 9, is characterized in that: described acquisition module comprises acquiring unit, pretreatment unit,
Described acquiring unit, for obtaining data;
Described pretreatment unit, for gathering the attribute of described data, and processes abnormal data; Described pretreatment unit is also for extracting the attribute set useful to data processing to according to described data attribute, and the property value extracted corresponding to described attribute set forms the data of described prediction form.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510943565.8A CN105574544A (en) | 2015-12-16 | 2015-12-16 | Data processing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510943565.8A CN105574544A (en) | 2015-12-16 | 2015-12-16 | Data processing method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105574544A true CN105574544A (en) | 2016-05-11 |
Family
ID=55884650
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510943565.8A Pending CN105574544A (en) | 2015-12-16 | 2015-12-16 | Data processing method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105574544A (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106156315A (en) * | 2016-07-01 | 2016-11-23 | 中国人民解放军装备学院 | A kind of data quality monitoring method judged based on disaggregated model |
CN109446017A (en) * | 2018-09-03 | 2019-03-08 | 平安科技(深圳)有限公司 | A kind of alarm algorithm generation method, monitoring system and terminal device |
CN109544150A (en) * | 2018-10-09 | 2019-03-29 | 阿里巴巴集团控股有限公司 | A kind of method of generating classification model and device calculate equipment and storage medium |
CN109615538A (en) * | 2018-12-13 | 2019-04-12 | 平安医疗健康管理股份有限公司 | Social security violation detection method, device, equipment and computer storage medium |
CN111242793A (en) * | 2020-01-16 | 2020-06-05 | 上海金仕达卫宁软件科技有限公司 | Method and device for detecting medical insurance data abnormity |
CN111291896A (en) * | 2020-02-03 | 2020-06-16 | 深圳前海微众银行股份有限公司 | Interactive random forest subtree screening method, device, equipment and readable medium |
CN111967229A (en) * | 2020-09-01 | 2020-11-20 | 申建常 | Efficient label type data analysis method and analysis system |
CN112651439A (en) * | 2020-12-25 | 2021-04-13 | 平安科技(深圳)有限公司 | Material classification method and device, computer equipment and storage medium |
CN113038302A (en) * | 2019-12-25 | 2021-06-25 | 中国电信股份有限公司 | Flow prediction method and device and computer storage medium |
CN113762394A (en) * | 2021-09-09 | 2021-12-07 | 昆明理工大学 | Blasting block size prediction method |
CN114913372A (en) * | 2022-05-10 | 2022-08-16 | 电子科技大学 | Target recognition algorithm based on multi-mode data integration decision |
CN109919197B (en) * | 2019-02-13 | 2023-07-21 | 创新先进技术有限公司 | Random forest model training method and device |
CN113762394B (en) * | 2021-09-09 | 2024-04-26 | 昆明理工大学 | Blasting block prediction method |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101251851A (en) * | 2008-02-29 | 2008-08-27 | 吉林大学 | Multi-classifier integrating method based on increment native Bayes network |
CN104391970A (en) * | 2014-12-04 | 2015-03-04 | 深圳先进技术研究院 | Attribute subspace weighted random forest data processing method |
CN104636814A (en) * | 2013-11-14 | 2015-05-20 | 中国科学院深圳先进技术研究院 | Method and system for optimizing random forest models |
-
2015
- 2015-12-16 CN CN201510943565.8A patent/CN105574544A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101251851A (en) * | 2008-02-29 | 2008-08-27 | 吉林大学 | Multi-classifier integrating method based on increment native Bayes network |
CN104636814A (en) * | 2013-11-14 | 2015-05-20 | 中国科学院深圳先进技术研究院 | Method and system for optimizing random forest models |
CN104391970A (en) * | 2014-12-04 | 2015-03-04 | 深圳先进技术研究院 | Attribute subspace weighted random forest data processing method |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106156315A (en) * | 2016-07-01 | 2016-11-23 | 中国人民解放军装备学院 | A kind of data quality monitoring method judged based on disaggregated model |
CN106156315B (en) * | 2016-07-01 | 2019-05-17 | 中国人民解放军装备学院 | A kind of data quality monitoring method based on disaggregated model judgement |
CN109446017A (en) * | 2018-09-03 | 2019-03-08 | 平安科技(深圳)有限公司 | A kind of alarm algorithm generation method, monitoring system and terminal device |
CN109544150A (en) * | 2018-10-09 | 2019-03-29 | 阿里巴巴集团控股有限公司 | A kind of method of generating classification model and device calculate equipment and storage medium |
CN109615538A (en) * | 2018-12-13 | 2019-04-12 | 平安医疗健康管理股份有限公司 | Social security violation detection method, device, equipment and computer storage medium |
CN109919197B (en) * | 2019-02-13 | 2023-07-21 | 创新先进技术有限公司 | Random forest model training method and device |
CN113038302A (en) * | 2019-12-25 | 2021-06-25 | 中国电信股份有限公司 | Flow prediction method and device and computer storage medium |
CN113038302B (en) * | 2019-12-25 | 2022-09-30 | 中国电信股份有限公司 | Flow prediction method and device and computer storage medium |
CN111242793B (en) * | 2020-01-16 | 2024-02-06 | 上海金仕达卫宁软件科技有限公司 | Medical insurance data abnormality detection method and device |
CN111242793A (en) * | 2020-01-16 | 2020-06-05 | 上海金仕达卫宁软件科技有限公司 | Method and device for detecting medical insurance data abnormity |
CN111291896A (en) * | 2020-02-03 | 2020-06-16 | 深圳前海微众银行股份有限公司 | Interactive random forest subtree screening method, device, equipment and readable medium |
CN111291896B (en) * | 2020-02-03 | 2022-02-01 | 深圳前海微众银行股份有限公司 | Interactive random forest subtree screening method, device, equipment and readable medium |
CN111967229A (en) * | 2020-09-01 | 2020-11-20 | 申建常 | Efficient label type data analysis method and analysis system |
CN112651439A (en) * | 2020-12-25 | 2021-04-13 | 平安科技(深圳)有限公司 | Material classification method and device, computer equipment and storage medium |
CN112651439B (en) * | 2020-12-25 | 2023-12-22 | 平安科技(深圳)有限公司 | Material classification method, device, computer equipment and storage medium |
CN113762394A (en) * | 2021-09-09 | 2021-12-07 | 昆明理工大学 | Blasting block size prediction method |
CN113762394B (en) * | 2021-09-09 | 2024-04-26 | 昆明理工大学 | Blasting block prediction method |
CN114913372B (en) * | 2022-05-10 | 2023-05-26 | 电子科技大学 | Target recognition method based on multi-mode data integration decision |
CN114913372A (en) * | 2022-05-10 | 2022-08-16 | 电子科技大学 | Target recognition algorithm based on multi-mode data integration decision |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105574544A (en) | Data processing method and device | |
CN110223168B (en) | Label propagation anti-fraud detection method and system based on enterprise relationship map | |
CN103793484B (en) | The fraud identifying system based on machine learning in classification information website | |
CN108629413A (en) | Neural network model training, trading activity Risk Identification Method and device | |
CN104050556B (en) | The feature selection approach and its detection method of a kind of spam | |
CN103092975A (en) | Detection and filter method of network community garbage information based on topic consensus coverage rate | |
CN108021651A (en) | Network public opinion risk assessment method and device | |
CN109818961A (en) | A kind of network inbreak detection method, device and equipment | |
CN110909195A (en) | Picture labeling method and device based on block chain, storage medium and server | |
CN107368856A (en) | Clustering method and device, the computer installation and readable storage medium storing program for executing of Malware | |
CN111538741B (en) | Deep learning analysis method and system for big data of alarm condition | |
CN110310114A (en) | Object classification method, device, server and storage medium | |
CN115577701B (en) | Risk behavior identification method, device, equipment and medium aiming at big data security | |
CN107003992A (en) | Perception associative memory for neural language performance identifying system | |
CN112329816A (en) | Data classification method and device, electronic equipment and readable storage medium | |
CN107679135A (en) | The topic detection of network-oriented text big data and tracking, device | |
CN104504334B (en) | System and method for assessing classifying rules selectivity | |
CN109191210A (en) | A kind of broadband target user's recognition methods based on Adaboost algorithm | |
CN110458296A (en) | The labeling method and device of object event, storage medium and electronic device | |
CN112580902A (en) | Object data processing method and device, computer equipment and storage medium | |
CN115794803A (en) | Engineering audit problem monitoring method and system based on big data AI technology | |
CN104102730B (en) | Known label-based big data normal mode extracting method and system | |
CN112990989B (en) | Value prediction model input data generation method, device, equipment and medium | |
CN112217908B (en) | Information pushing method and device based on transfer learning and computer equipment | |
CN111784360B (en) | Anti-fraud prediction method and system based on network link backtracking |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20160511 |