CN109408583B - Data processing method and device, computer readable storage medium and electronic equipment - Google Patents

Data processing method and device, computer readable storage medium and electronic equipment Download PDF

Info

Publication number
CN109408583B
CN109408583B CN201811117037.7A CN201811117037A CN109408583B CN 109408583 B CN109408583 B CN 109408583B CN 201811117037 A CN201811117037 A CN 201811117037A CN 109408583 B CN109408583 B CN 109408583B
Authority
CN
China
Prior art keywords
data
sub
binning
univariate
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811117037.7A
Other languages
Chinese (zh)
Other versions
CN109408583A (en
Inventor
郭继昌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201811117037.7A priority Critical patent/CN109408583B/en
Publication of CN109408583A publication Critical patent/CN109408583A/en
Application granted granted Critical
Publication of CN109408583B publication Critical patent/CN109408583B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The present disclosure belongs to the technical field of big data, and relates to a data processing method and apparatus, a computer readable storage medium, and an electronic device, wherein the data processing method includes: obtaining a plurality of sample data, wherein each sample data comprises one or more dimensionality sub-sample data; dividing the sub-sample data of each dimension into multiple groups of sub-boxes respectively, and forming multiple univariate sub-box decision trees according to the sub-boxes; obtaining target binning corresponding to each dimension according to a plurality of univariate binning decision trees; and inputting the target bins into a prediction model so as to perform machine training on the prediction model. On one hand, the method can eliminate data noise and improve the stability of the model; on the other hand, the box separation method is simple, and data mining personnel are not required to have rich business background knowledge; and by binning the data, a large number of repeated values are reduced, and the speed of the algorithm is improved.

Description

Data processing method and device, computer readable storage medium and electronic equipment
Technical Field
The present disclosure relates to the field of big data technologies, and in particular, to a data processing method, a data processing apparatus, a computer-readable storage medium, and an electronic device.
Background
With economic development and social progress, intelligent terminal devices such as computers and smart phones are applied more and more widely, and data needs to be mined and analyzed to obtain valuable data information.
Because data can have random error, abnormal value, extreme value and other numerical value noises during measurement, the numerical value noises can influence the accuracy of the model, in addition, the measured data can have a large number of unrepeated values, the speed of the algorithm can be influenced if the measured data is directly used, and part of the algorithm does not support continuous variables, the data needs to be preprocessed. The data is discretized by means of binning, and numerical noise is eliminated and repeated values are reduced. However, the commonly used box separation methods are mainly equal-frequency and equal-distance box separation methods, the box separation methods are single in means, frequency and distance are not easy to determine, data mining personnel are required to have enough service background cognition on the data, otherwise, effective box separation cannot be performed, and accuracy of the model is low.
Therefore, a new data processing method and apparatus are needed in the art.
It is noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure and therefore may include information that does not constitute prior art that is already known to a person of ordinary skill in the art.
Disclosure of Invention
The present disclosure is directed to a data processing method, a data processing apparatus, a computer readable storage medium, and an electronic device, so as to overcome the influence of numerical noise on model stability at least to a certain extent, and meanwhile, avoid the problem that data mining personnel cannot effectively discretize data in the absence of business background knowledge, so as to improve the flexibility and calculation speed of the model.
According to an aspect of the present disclosure, there is provided a data processing method, including:
obtaining a plurality of sample data, wherein each sample data comprises one or more dimensionality sub-sample data;
dividing the dimensional sub-sample data into a plurality of groups of sub-boxes respectively, and forming a plurality of univariate sub-box decision trees according to the sub-boxes;
obtaining a target sub-box corresponding to the dimension according to the plurality of univariate sub-box decision trees;
inputting the target bins to a predictive model for machine training of the predictive model.
In an exemplary embodiment of the present disclosure, dividing the subsample data of the dimension into multiple component bins, respectively, comprises:
dividing the sub-sample data into multiple component bins according to different frequencies; or alternatively
And dividing the sub-sample data into a plurality of groups of sub-boxes according to the number of preset nodes.
In an exemplary embodiment of the present disclosure, each of the sample data includes target data, and a plurality of univariate binning decision trees are formed according to the binning, including:
and forming the single-variable box-dividing decision tree by taking the sub-sample data as a root node, the sub-boxes as non-leaf nodes and the target data as leaf nodes.
In an exemplary embodiment of the present disclosure, obtaining a target bin corresponding to the dimension from a plurality of the univariate binning decision trees comprises:
calculating the sub information value of each leaf node in each univariate binning decision tree;
calculating the information value of each univariate binning decision tree according to the sub-information values;
and comparing the information values of the univariate binning decision trees, and taking the bin corresponding to the univariate binning decision tree with the minimum information value as the target bin.
In an exemplary embodiment of the present disclosure, calculating the information value of each of the univariate binning decision trees from the child information values comprises:
and adding the sub information values of the leaf nodes in the univariate binning decision trees to obtain the information value.
In an exemplary embodiment of the present disclosure, each of the sample data further includes target data, and the target binning is input to a prediction model to machine train the prediction model, including:
and inputting the target box as an input vector and the target data as an output vector to the prediction model so as to perform machine training on the prediction model.
In an exemplary embodiment of the present disclosure, the method further comprises:
acquiring data to be analyzed, wherein the data to be analyzed has data with the same dimensionality as the sample data;
and inputting the data to be analyzed into the prediction model to obtain a prediction result.
According to an aspect of the present disclosure, there is provided a data processing apparatus, comprising:
the device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a plurality of sample data, and each sample data comprises one or more dimensionalities of sub-sample data;
the decision tree forming module is used for dividing the dimensional sub-sample data into a plurality of groups of sub-boxes respectively and forming a plurality of univariate sub-box decision trees according to the sub-boxes;
the target binning obtaining module is used for obtaining target binning corresponding to the dimensionality according to the univariate binning decision trees;
and the model training module is used for inputting the target boxes into a prediction model so as to perform machine training on the prediction model.
According to an aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a data processing method as described in any one of the above.
According to an aspect of the present disclosure, there is provided an electronic device including:
a processor; and
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform any one of the data processing methods described above via execution of the executable instructions.
Dividing sub-sample data of each dimension into a plurality of groups of branch boxes, forming a univariate branch box decision tree according to the branch boxes, and calculating and summing information values of leaf nodes in each univariate branch box decision tree to obtain information values corresponding to each group of branch boxes; then, the information values of the groups of the bins are compared, and the bin with the minimum information value is taken as a target bin. After target sub-boxes corresponding to all dimensions are obtained, inputting the target sub-boxes and target data into a prediction model so as to perform machine training on the prediction model; after the training is finished, the data to be analyzed is input into the prediction model, and then the prediction result can be obtained. According to the data processing method, on one hand, data noise can be eliminated, and the stability of the model is improved; on the other hand, the box separation method is simple, and data mining personnel are not required to have rich business background knowledge; and by binning the data, a large number of repeated values are reduced, and the speed of the algorithm is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty.
FIG. 1 schematically illustrates a flow chart of a method of data processing;
FIG. 2 schematically illustrates an exemplary diagram of an application scenario of a data processing method;
3A-3C schematically illustrate a structural schematic of a univariate binning decision tree;
FIG. 4 schematically shows a block diagram of a data processing apparatus;
FIG. 5 schematically illustrates an example block diagram of an electronic device for implementing a data processing method;
fig. 6 schematically shows a computer-readable storage medium for implementing the data processing method.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and the like. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.
Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
In the related art in the field, when a data mining person mines data, the data may have numerical noises such as random errors, outliers, extreme values, etc., which may affect the accuracy of the model, for example, extreme values may cause the model parameters to be too high or too low, or cause the model to be "confused" by a false phenomenon, and the relationship that does not exist originally is taken as an important mode to learn. In order to eliminate numerical noise existing in data mining, data is discretized by adopting an equal-frequency and equal-distance binning method, but the equal-frequency and equal-distance binning method is single in means, frequency and distance are not easy to determine, and sufficient business background knowledge needs to be provided for the data, so that the stability of a model in the related technology is poor, and the calculation speed of the model is reduced due to the fact that a large number of repeated values exist in the data.
In view of the problems in the related art in the field, the present exemplary embodiment first provides a data processing method, where the data processing method may be executed on a server, or may also be executed on a server cluster or a cloud server, and of course, a person skilled in the art may also execute the method of the present disclosure on other platforms as needed, and this is not limited in the present exemplary embodiment. Referring to fig. 1, the data processing method may include the steps of:
s110, obtaining a plurality of sample data, wherein each sample data comprises one or more dimensions of sub-sample data;
s120, dividing the dimensional sub-sample data into multiple groups of sub-boxes respectively, and forming multiple univariate sub-box decision trees according to the sub-boxes;
s130, obtaining target sub-boxes corresponding to the dimensions according to the univariate sub-box decision trees;
and S140, inputting the target boxes into a prediction model so as to perform machine training on the prediction model.
In the data processing method, the sub-sample data of each dimension is divided into a plurality of component boxes, and a plurality of univariate component box decision trees are formed according to the component boxes; then, obtaining target sub-boxes corresponding to all dimensions according to the univariate sub-box decision tree; and finally, inputting the target bins into the prediction model for training the prediction model. The data processing method in the disclosure can eliminate data noise and improve the stability of the model; on the other hand, the box separation method is simple, and data mining personnel are not required to have rich business background knowledge; and by binning the data, a large number of repeated values are reduced, and the speed of the algorithm is improved.
Next, each step in the above-described data processing method in the present exemplary embodiment will be explained and explained in detail with reference to fig. 2.
In step S110, a plurality of sample data is obtained, each sample data including one or more dimensions of sub-sample data.
In the exemplary embodiment of the present disclosure, a plurality of sample data may be obtained from a data warehouse of the server 201 or the terminal device 202, specifically, the sample data includes behavior data and attribute data, for example, an insurance company evaluates a business person and determines a possibility that the business person becomes a supervisor, the behavior data in the sample data may be characteristics of the business person, such as monthly income, monthly sales volume, quarterly sales volume, and attendance, and the attribute data may be characteristics of the business person, such as age or skill level.
In an exemplary embodiment of the present disclosure, the sample data may include one dimension of sub-sample data, or may include a plurality of dimensions of sub-sample data, for example, the possibility of the business person becoming supervisor can be predicted only by age, or the possibility of the business person becoming supervisor can be predicted by age, quarterly sales volume, monthly income, and other dimensions of data. Of course, for more comprehensive evaluation and accurate prediction results, it is preferable to use multidimensional data for prediction.
In step S120, the sub-sample data of each dimension is divided into multiple component bins, and multiple univariate binning decision trees are formed according to the bins.
In an exemplary embodiment of the present disclosure, the sub-sample data of each dimension may be divided into multiple component boxes, and the sub-sample data may be divided into the multiple component boxes according to different frequencies during the binning, where a frequency refers to a data amount included in each data interval in each component box, and correspondingly, different frequencies refer to different data amounts in data intervals corresponding to each component box, and the sub-sample data may also be divided into the multiple component boxes according to a preset number of nodes, and of course, other binning rules may also be adopted, which is not specifically limited by the present disclosure. For example, business persons are 20-60 years old, then the frequency can be set to 8, 10, 20, and the age data is divided into three component bins. When the frequency is 8, the sub-boxes are [20,28 ], [28,36 ], [36,44 ], [44,52 ], [52,60]; when the frequency is 10, the sub-boxes are [20,30 ], [30,40 ], [40,50 ], [50,60]; when the frequency is 20, the bins are [20,40 ], [40,60]. Or 3 nodes can be set, and age data can be randomly classified into groups to form a multi-group classification box. For example, 3 nodes are set to form three component boxes, each component box comprises 4 component boxes, wherein the first component box is: [20,30 ], [30,40 ], [40,50 ], [50,60]; the second component box is: [20,35), [35,45), [45,55 ], [55,60]; the third component box is: [20, 25), [25, 45), [45,55 ] and [55,60].
In an exemplary embodiment of the present disclosure, the sample data further includes target data, for example, whether the business person is a supervisor or not may be the target data. Table 1 shows sample data containing one-dimensional subsample data, table 1 is as follows:
age (age) Whether or not it is a main pipe
1 20 N
2 28 Y
3 30 N
4 35 Y
5 37 Y
6 42 Y
7 45 N
8 50 Y
9 55 N
10 60 N
TABLE 1
Table 1 shows sample data of an age dimension and target data corresponding to each sample data. In the present disclosure, a single variable binning decision tree is formed with child sample data as a root node, binning as non-leaf nodes, and target data as leaf nodes. Fig. 3A to 3C are schematic diagrams illustrating the structure of a univariate binning decision tree, and as shown in fig. 3A to 3C, the univariate binning decision tree formed by a first bin, a second bin and a third bin is shown.
In step S130, a target bin corresponding to the dimension is obtained according to a plurality of the univariate binning decision trees.
In an exemplary embodiment of the present disclosure, after forming the univariate binning decision tree, the child information values of the leaf nodes may be calculated, then the information values corresponding to the binning groups are obtained according to the child information values of the leaf nodes, and finally the information values of the univariate binning decision trees are compared, and the binning corresponding to the univariate binning decision tree with the smallest information value is taken as the target binning.
In an exemplary embodiment of the present disclosure, the calculation formula of the child information value of a leaf node is:
Figure BDA0001810862570000071
where m is the number of responsible business persons in the target data, and n is the number of unsubscribed business persons in the target data.
Taking the univariate binning decision tree shown in fig. 3A-3C as an example, it can be known from the calculation of formula (1) that the child information values of the leaf nodes in the first binning are 0.3010,0.2764,0.3010, and 0.2764, respectively; the sub information values of each leaf node in the second component box are respectively 0.2764,0,0.3010 and 0; the child information values of the leaf nodes in the third component box are 0,0.2173,0.3010 and 0, respectively. And then adding leaf node sub-information values in the three component boxes to obtain the information value of each component box, wherein the information values of each component box are as follows: 1.1548,0.5774,0.5183. And comparing to obtain the information value of the third component box, wherein the target component box is the third component box if the information value of the third component box is the minimum.
In step S140, the target bins are input to a predictive model for machine training of the predictive model.
In an exemplary embodiment of the present disclosure, the target bins may be input into the predictive model as input vectors and the target data as output vectors to machine train the predictive model. The prediction model may be a neural network model or a decision tree model, which is not specifically limited by the present disclosure. After obtaining a plurality of target bins from the multi-dimensional subsample data, the plurality of target bins and corresponding target data may be input to a predictive model to obtain a predicted result.
On one hand, the data processing method can eliminate data noise and improve the stability of the model; on the other hand, the box separation method is simple, and data mining personnel are not required to have rich business background knowledge; and by binning the data, a large number of repeated values are reduced, and the speed of the algorithm is improved.
In an exemplary embodiment of the present disclosure, after training of the model is completed, data to be analyzed may be obtained, and then the data to be analyzed is input into the prediction model, so as to obtain a prediction result output by the prediction model. The data to be analyzed has the same dimension as the sample data, for example, the data to be analyzed may be the age of a certain service person, or the age, the quarterly sales volume, the skill level, etc. of a certain service person, and the possibility of becoming a supervisor can be obtained by inputting the age or age, the quarterly sales volume, and the skill level into the prediction model.
The present disclosure also provides a data processing apparatus. Fig. 4 shows a schematic structural diagram of a data processing apparatus, which may include a data acquisition module 410, a decision tree formation module 420, a target bin acquisition module 430, and a model training module 440, as shown in fig. 4. Wherein:
a data obtaining module 410, configured to obtain a plurality of sample data, where each sample data includes one or more dimensions of sub-sample data;
a decision tree forming module 420, configured to divide the subsample data of the dimensionality into multiple component bins, and form multiple univariate binning decision trees according to the bins;
a target binning obtaining module 430, configured to obtain a target binning corresponding to the dimension according to the plurality of univariate binning decision trees;
and the model training module 440 is configured to input the target bins to a prediction model so as to perform machine training on the prediction model.
The specific details of each module in the data processing apparatus have been described in detail in the corresponding data processing method, and therefore are not described herein again.
It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.
Moreover, although the steps of the methods of the present disclosure are depicted in the drawings in a particular order, this does not require or imply that the steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a mobile terminal, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.
In an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.
As will be appreciated by one skilled in the art, aspects of the present disclosure may be embodied as a system, method or program product. Accordingly, various aspects of the present disclosure may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.
An electronic device 500 according to this embodiment of the disclosure is described below with reference to fig. 5. The electronic device 500 shown in fig. 5 is only an example and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 5, the electronic device 500 is embodied in the form of a general purpose computing device. The components of the electronic device 500 may include, but are not limited to: the at least one processing unit 510, the at least one memory unit 520, and a bus 530 that couples various system components including the memory unit 520 and the processing unit 510.
Wherein the storage unit stores program code that is executable by the processing unit 510 to cause the processing unit 510 to perform steps according to various exemplary embodiments of the present disclosure as described in the above section "exemplary methods" of this specification. For example, the processing unit 510 may execute step S110 as shown in fig. 1: obtaining a plurality of sample data, wherein each sample data comprises one or more dimensionality sub-sample data; step S120: dividing the dimensional sub-sample data into a plurality of groups of sub-boxes respectively, and forming a plurality of univariate sub-box decision trees according to the sub-boxes; step S130: obtaining a target sub-box corresponding to the dimension according to the plurality of univariate sub-box decision trees; step S140: inputting the target bins to a predictive model for machine training of the predictive model.
The memory unit 520 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM) 5201 and/or a cache memory unit 5202, and may further include a read-only memory unit (ROM) 5203.
Storage unit 520 may also include a program/utility 5204 having a set (at least one) of program modules 5205, such program modules 5205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which or some combination thereof may comprise an implementation of a network environment.
Bus 530 may be one or more of any of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 500 may also communicate with one or more external devices 700 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 500, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 500 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interfaces 550. Also, the electronic device 500 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 560. As shown, the network adapter 560 communicates with the other modules of the electronic device 500 over a bus 530. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 500, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, to name a few.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.
In an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, various aspects of the disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps according to various exemplary embodiments of the disclosure described in the "exemplary methods" section above of this specification, when the program product is run on the terminal device.
Referring to fig. 6, a program product 600 for implementing the above method according to an embodiment of the present disclosure is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In situations involving remote computing devices, the remote computing devices may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to external computing devices (e.g., through the internet using an internet service provider).
Furthermore, the above-described figures are merely schematic illustrations of processes included in methods according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed, for example, synchronously or asynchronously in multiple modules.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims (7)

1. A method of data processing, comprising:
obtaining a plurality of sample data, wherein each sample data comprises one or more dimensionality sub-sample data;
dividing the subsample data of the dimensionality into multiple component boxes respectively, and forming a plurality of univariate box-dividing decision trees according to the component boxes, wherein the method comprises the following steps: forming the univariate binning decision tree by taking the sub-sample data as a root node, the binning as a non-leaf node and target data as a leaf node; wherein each of the sample data comprises target data;
obtaining a target bin corresponding to the dimension according to the plurality of univariate bin decision trees, including: calculating the sub information value of each leaf node in each univariate binning decision tree; calculating the information value of each univariate binning decision tree according to the sub-information values, wherein the calculation comprises the following steps: adding the sub-information values of the leaf nodes in each of the univariate binning decision trees to obtain the information value; comparing the information values of the univariate binning decision trees, and taking the bin corresponding to the univariate binning decision tree with the minimum information value as the target bin;
and inputting the target bins into a prediction model so as to perform machine training on the prediction model.
2. The data processing method of claim 1, wherein dividing the subsample data of the dimension into multiple component bins, respectively, comprises:
dividing the sub-sample data into multiple component bins according to different frequencies; or
And dividing the sub-sample data into a plurality of groups of sub-boxes according to the number of preset nodes.
3. The data processing method of claim 1, wherein each of the sample data further comprises target data, and wherein binning the targets into predictive models for machine training the predictive models comprises:
and inputting the target box as an input vector and the target data as an output vector to the prediction model so as to perform machine training on the prediction model.
4. The data processing method of claim 1, wherein the method further comprises:
acquiring data to be analyzed, wherein the data to be analyzed has data with the same dimensionality as the sample data;
and inputting the data to be analyzed into the prediction model to obtain a prediction result.
5. A data processing apparatus, characterized by comprising:
the device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a plurality of sample data, and each sample data comprises one or more dimensionalities of sub-sample data;
a decision tree forming module, configured to divide the subsample data of the dimension into multiple component bins, and form multiple univariate binning decision trees according to the bins, including: forming the univariate binning decision tree by taking the sub-sample data as a root node, the binning as a non-leaf node and the target data as a leaf node; wherein each of the sample data comprises target data;
a target binning obtaining module, configured to obtain a target binning corresponding to the dimension according to the plurality of univariate binning decision trees, including: calculating the sub information value of each leaf node in each univariate binning decision tree; calculating the information value of each univariate binning decision tree according to the sub-information values, wherein the calculation comprises the following steps: adding the sub information values of the leaf nodes in each univariate binning decision tree to obtain the information values; comparing the information values of the univariate binning decision trees, and taking the bin corresponding to the univariate binning decision tree with the minimum information value as the target bin;
and the model training module is used for inputting the target boxes into a prediction model so as to perform machine training on the prediction model.
6. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the data processing method of any one of claims 1 to 4.
7. An electronic device, comprising:
a processor; and
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform the data processing method of any of claims 1-4 via execution of the executable instructions.
CN201811117037.7A 2018-09-25 2018-09-25 Data processing method and device, computer readable storage medium and electronic equipment Active CN109408583B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811117037.7A CN109408583B (en) 2018-09-25 2018-09-25 Data processing method and device, computer readable storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811117037.7A CN109408583B (en) 2018-09-25 2018-09-25 Data processing method and device, computer readable storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN109408583A CN109408583A (en) 2019-03-01
CN109408583B true CN109408583B (en) 2023-04-07

Family

ID=65465141

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811117037.7A Active CN109408583B (en) 2018-09-25 2018-09-25 Data processing method and device, computer readable storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN109408583B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110245688A (en) * 2019-05-21 2019-09-17 中国平安财产保险股份有限公司 A kind of method and relevant apparatus of data processing
CN110532266A (en) * 2019-08-28 2019-12-03 京东数字科技控股有限公司 A kind of method and apparatus of data processing
CN110708285B (en) * 2019-08-30 2022-04-29 中国平安人寿保险股份有限公司 Flow monitoring method, device, medium and electronic equipment
CN110798227B (en) * 2019-09-19 2023-07-25 平安科技(深圳)有限公司 Model prediction optimization method, device, equipment and readable storage medium
CN113495906B (en) * 2020-03-20 2023-09-26 北京京东振世信息技术有限公司 Data processing method and device, computer readable storage medium and electronic equipment
CN112667741B (en) * 2020-04-13 2022-07-08 华控清交信息科技(北京)有限公司 Data processing method and device and data processing device
CN111507479B (en) * 2020-04-15 2021-08-10 深圳前海微众银行股份有限公司 Feature binning method, device, equipment and computer-readable storage medium
CN111782900B (en) * 2020-08-06 2024-03-19 平安银行股份有限公司 Abnormal service detection method and device, electronic equipment and storage medium
CN113837865A (en) * 2021-09-29 2021-12-24 重庆富民银行股份有限公司 Method for extracting multi-dimensional risk feature strategy

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7792770B1 (en) * 2007-08-24 2010-09-07 Louisiana Tech Research Foundation; A Division Of Louisiana Tech University Foundation, Inc. Method to indentify anomalous data using cascaded K-Means clustering and an ID3 decision tree
CN103942604A (en) * 2013-01-18 2014-07-23 上海安迪泰信息技术有限公司 Prediction method and system based on forest discrimination model
CN106250986A (en) * 2015-06-04 2016-12-21 波音公司 Advanced analysis base frame for machine learning
CN106707060A (en) * 2016-12-16 2017-05-24 中国电力科学研究院 Method for acquiring discrete state parameters of power transformer
CN107633265A (en) * 2017-09-04 2018-01-26 深圳市华傲数据技术有限公司 For optimizing the data processing method and device of credit evaluation model
CN108021984A (en) * 2016-11-01 2018-05-11 第四范式(北京)技术有限公司 Determine the method and system of the feature importance of machine learning sample
CN108182634A (en) * 2018-01-31 2018-06-19 国信优易数据有限公司 A kind of training method for borrowing or lending money prediction model, debt-credit Forecasting Methodology and device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8280915B2 (en) * 2006-02-01 2012-10-02 Oracle International Corporation Binning predictors using per-predictor trees and MDL pruning
US7571159B2 (en) * 2006-02-01 2009-08-04 Oracle International Corporation System and method for building decision tree classifiers using bitmap techniques
US9373087B2 (en) * 2012-10-25 2016-06-21 Microsoft Technology Licensing, Llc Decision tree training in machine learning
US10963810B2 (en) * 2014-06-30 2021-03-30 Amazon Technologies, Inc. Efficient duplicate detection for machine learning data sets

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7792770B1 (en) * 2007-08-24 2010-09-07 Louisiana Tech Research Foundation; A Division Of Louisiana Tech University Foundation, Inc. Method to indentify anomalous data using cascaded K-Means clustering and an ID3 decision tree
CN103942604A (en) * 2013-01-18 2014-07-23 上海安迪泰信息技术有限公司 Prediction method and system based on forest discrimination model
CN106250986A (en) * 2015-06-04 2016-12-21 波音公司 Advanced analysis base frame for machine learning
CN108021984A (en) * 2016-11-01 2018-05-11 第四范式(北京)技术有限公司 Determine the method and system of the feature importance of machine learning sample
CN106707060A (en) * 2016-12-16 2017-05-24 中国电力科学研究院 Method for acquiring discrete state parameters of power transformer
CN107633265A (en) * 2017-09-04 2018-01-26 深圳市华傲数据技术有限公司 For optimizing the data processing method and device of credit evaluation model
CN108182634A (en) * 2018-01-31 2018-06-19 国信优易数据有限公司 A kind of training method for borrowing or lending money prediction model, debt-credit Forecasting Methodology and device

Also Published As

Publication number Publication date
CN109408583A (en) 2019-03-01

Similar Documents

Publication Publication Date Title
CN109408583B (en) Data processing method and device, computer readable storage medium and electronic equipment
US10747637B2 (en) Detecting anomalous sensors
US9317578B2 (en) Decision tree insight discovery
CA3059937A1 (en) User credit evaluation method and device, electronic device, storage medium
CN109299177A (en) Data pick-up method, apparatus, storage medium and electronic equipment
CN113837596B (en) Fault determination method and device, electronic equipment and storage medium
CN110399268B (en) Abnormal data detection method, device and equipment
CN110708285B (en) Flow monitoring method, device, medium and electronic equipment
CN110727740B (en) Correlation analysis method and device, computer equipment and readable medium
CN111242387A (en) Talent departure prediction method and device, electronic equipment and storage medium
CN109828965B (en) Data processing method and electronic equipment
CN110348581B (en) User feature optimizing method, device, medium and electronic equipment in user feature group
CN110263083B (en) Knowledge graph processing method, device, equipment and medium
US20180039718A1 (en) Prediction of inhalable particles concentration
CN111724089A (en) Order receiving and dispatching distribution method, system, terminal and storage medium
CN113742450B (en) Method, device, electronic equipment and storage medium for user data grade falling label
CN114897183A (en) Problem data processing method, and deep learning model training method and device
CN114500075A (en) User abnormal behavior detection method and device, electronic equipment and storage medium
CN112784113A (en) Data processing method and device, computer readable storage medium and electronic equipment
US20230039971A1 (en) Automated return evaluation with anomoly detection
CN114529108B (en) Tree model based prediction method, apparatus, device, medium, and program product
US20230045574A1 (en) Automated calculation predictions with explanations
CN114596066A (en) Data anomaly detection method and device, medium and electronic equipment
CN113886541A (en) Demand evaluation information generation method, demand evaluation information display method and device
CN112766761A (en) Enterprise research and development investment potential evaluation method, device and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant