CN110909085A - Data processing method, device, equipment and storage medium - Google Patents

Data processing method, device, equipment and storage medium Download PDF

Info

Publication number
CN110909085A
CN110909085A CN201911177388.1A CN201911177388A CN110909085A CN 110909085 A CN110909085 A CN 110909085A CN 201911177388 A CN201911177388 A CN 201911177388A CN 110909085 A CN110909085 A CN 110909085A
Authority
CN
China
Prior art keywords
adjusted
binning
characteristic
data
sub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911177388.1A
Other languages
Chinese (zh)
Inventor
陈瑞钦
黄启军
李诗琦
唐兴兴
林冰垠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WeBank Co Ltd
Original Assignee
WeBank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WeBank Co Ltd filed Critical WeBank Co Ltd
Priority to CN201911177388.1A priority Critical patent/CN110909085A/en
Publication of CN110909085A publication Critical patent/CN110909085A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Algebra (AREA)
  • Operations Research (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The invention relates to the field of financial science and technology, and discloses a data processing method, a device, equipment and a storage medium, wherein the data processing method comprises the following steps: acquiring the binning split points of each characteristic bin, and grouping the characteristic data blocks of each characteristic bin according to the binning split points to generate corresponding relations between each characteristic bin and the characteristic data blocks of each characteristic bin; if a binning adjustment instruction is detected, determining a binning to be adjusted and a feature data block to be adjusted of the binning to be adjusted from each feature binning according to the binning adjustment instruction and the corresponding relation; and adjusting the sub-boxes to be adjusted and the characteristic data blocks to be adjusted, and outputting an adjustment result. The invention solves the technical problem of low data processing efficiency caused by untimely data response when the traditional box data adjusting method faces mass data.

Description

Data processing method, device, equipment and storage medium
Technical Field
The present invention relates to the field of financial technology, and in particular, to a data processing method, apparatus, device, and storage medium.
Background
With the development of computer technology, more and more technologies (big data, distributed, Blockchain, artificial intelligence, etc.) are applied to the financial field, and the traditional financial industry is gradually changing to financial technology (Fintech), but higher requirements are also put forward on the technologies due to the requirements of security and real-time performance of the financial industry.
Feature binning is a data preprocessing technique used to reduce the effects of minor observation errors, and is a method of grouping multiple consecutive values into a smaller number of "bins". In the practical use process, a user can adjust the box separation result according to business experience, and the box separation point can be changed due to the box separation adjustment, so that the statistical information in the box is changed, and statistics needs to be carried out again. However, when the system is confronted with massive data, the statistics process becomes heavy due to the statistics of massive data again, the statistics process is too time-consuming, the data response speed of the characteristic sub-box is greatly reduced, the data response is not timely, the operation performance is greatly reduced, and the data processing efficiency of the system is reduced.
Disclosure of Invention
The invention mainly aims to provide a data processing method, a data processing device, data processing equipment and a storage medium, and aims to solve the technical problem that when a traditional box data adjusting method faces mass data, data response is not timely, so that the data processing efficiency is low.
In order to achieve the above object, an embodiment of the present invention provides a data processing method, where the data processing method includes:
acquiring the binning split points of each characteristic bin, and grouping the characteristic data blocks of each characteristic bin according to the binning split points to generate corresponding relations between each characteristic bin and the characteristic data blocks of each characteristic bin;
if a binning adjustment instruction is detected, determining a binning to be adjusted and a feature data block to be adjusted of the binning to be adjusted from each feature binning according to the binning adjustment instruction and the corresponding relation;
and adjusting the sub-boxes to be adjusted and the characteristic data blocks to be adjusted, and outputting an adjustment result.
Optionally, the grouping the feature data blocks of each feature bin according to the bin splitting bit to generate a corresponding relationship between each feature bin and the feature data block of each feature bin includes:
caching the characteristic data blocks of each characteristic box, and grouping the characteristic data blocks of each characteristic box according to box-dividing positions to generate a corresponding relation between each characteristic box and the characteristic data block of each characteristic box;
the adjusting the to-be-adjusted sub-box and the to-be-adjusted characteristic data block and outputting an adjusting result includes:
and adjusting the sub-boxes to be adjusted and the characteristic data blocks to be adjusted in the cache, and outputting an adjustment result.
Optionally, the adjusting, in the cache, the binning to be adjusted and the feature data block to be adjusted includes:
acquiring a to-be-adjusted quantile point of the to-be-adjusted characteristic data block in a cache, and acquiring an instruction type of the box-dividing adjusting instruction;
and performing cache adjustment processing on the sub-box to be adjusted and the feature data block to be adjusted according to the instruction type, the sub-position point to be adjusted and the sub-box sub-position point.
Optionally, the performing, according to the instruction type, the binning point to be adjusted, and the binning point, cache adjustment processing on the binning block to be adjusted and the feature data block to be adjusted includes:
if the instruction type is a binning splitting type, splitting the to-be-adjusted binning and the to-be-adjusted feature data block according to the to-be-adjusted binning point and the binning splitting point to obtain a plurality of target splitting bins and target splitting data blocks corresponding to the target splitting bins;
and acquiring first statistical information of each target split data, and generating a cache adjustment result according to each target split sub-box, the target split data corresponding to each target split sub-box and the first statistical information corresponding to each target split data.
Optionally, the performing, according to the instruction type, the binning point to be adjusted, and the binning point, cache adjustment processing on the binning block to be adjusted and the feature data block to be adjusted includes:
if the instruction type is a binning merging type, merging the to-be-adjusted binning block and the to-be-adjusted feature data block according to the to-be-adjusted binning point and the binning point to obtain a target merging bin and a target merging data block corresponding to the target merging bin;
acquiring second statistical information of a sub-box to be adjusted, and adding and summarizing the second statistical information to generate target statistical information;
and generating a cache adjustment result according to the target merged data block and the target statistical information.
Optionally, after adjusting the to-be-adjusted binning and the to-be-adjusted feature data block and outputting an adjustment processing result, the method further includes:
counting the information value of each characteristic sub-box in the adjustment processing result;
if the information value is greater than or equal to the preset value, the adjustment processing effect is determined to be qualified;
and if the information value is less than the preset value, determining that the adjusting treatment effect is unqualified.
Optionally, the information value of each feature bin in the statistical adjustment processing result includes:
counting the event value and non-event value of each feature sub-box in the adjustment processing result to obtain woe value;
and obtaining information value according to the event value, the non-event value and the woe value.
The present invention also provides a data processing apparatus, comprising:
the relation module is used for acquiring the binning split points of each characteristic bin, and grouping the characteristic data blocks of each characteristic bin according to the binning split points to generate the corresponding relation between each characteristic bin and the characteristic data block of each characteristic bin;
the determining module is used for determining the sub-boxes to be adjusted and the feature data blocks to be adjusted of the sub-boxes to be adjusted from all the feature sub-boxes according to the sub-box adjusting instructions and the corresponding relations if the sub-box adjusting instructions are detected;
and the adjusting module is used for adjusting the sub-boxes to be adjusted and the characteristic data blocks to be adjusted and outputting an adjusting result.
Optionally, the relationship module comprises:
the cache processing unit is used for caching the characteristic data blocks of each characteristic sub-box and grouping the characteristic data blocks of each characteristic sub-box according to the sub-box sub-position points so as to generate the corresponding relation between each characteristic sub-box and the characteristic data block of each characteristic sub-box;
the adjustment module includes:
and the cache adjusting unit is used for adjusting the sub-boxes to be adjusted and the characteristic data blocks to be adjusted in the cache and outputting an adjusting result.
Optionally, the cache adjusting unit includes:
the instruction type subunit is used for acquiring the to-be-adjusted quantile points of the to-be-adjusted characteristic data block in a cache and acquiring the instruction type of the box dividing adjustment instruction;
and the cache adjusting subunit is used for performing cache adjusting processing on the to-be-adjusted sub-box and the to-be-adjusted feature data block according to the instruction type, the to-be-adjusted sub-position point and the sub-box sub-position point.
Optionally, the cache adjusting subunit is configured to:
if the instruction type is a binning splitting type, splitting the to-be-adjusted binning and the to-be-adjusted feature data block according to the to-be-adjusted binning point and the binning splitting point to obtain a plurality of target splitting bins and target splitting data blocks corresponding to the target splitting bins;
and acquiring first statistical information of each target split data, and generating a cache adjustment result according to each target split sub-box, the target split data corresponding to each target split sub-box and the first statistical information corresponding to each target split data.
Optionally, the cache adjusting subunit is configured to:
if the instruction type is a binning merging type, merging the to-be-adjusted binning block and the to-be-adjusted feature data block according to the to-be-adjusted binning point and the binning point to obtain a target merging bin and a target merging data block corresponding to the target merging bin;
acquiring second statistical information of a sub-box to be adjusted, and adding and summarizing the second statistical information to generate target statistical information;
and generating a cache adjustment result according to the target merged data block and the target statistical information.
Optionally, the data processing apparatus further includes:
the statistical module is used for counting the information value of each characteristic sub-box in the adjustment processing result;
the qualified module is used for confirming that the adjustment processing effect is qualified if the information value is greater than or equal to the preset value;
and the disqualification module is used for confirming that the adjustment processing effect is disqualified if the information value is less than the preset value.
Optionally, the statistics module includes:
the statistical unit is used for counting the event value and non-event value of each characteristic sub-box in the adjustment processing result to obtain woe value;
and the information value unit is used for obtaining the information value according to the event value, the non-event value and the woe value.
Further, to achieve the above object, the present invention also provides an apparatus comprising: a memory, a processor, and a data processing program stored on the memory and executable on the processor, wherein:
the data processing program, when executed by the processor, implements the steps of the data processing method as described above.
In addition, to achieve the above object, the present invention also provides a computer storage medium;
the computer storage medium has stored thereon a data processing program which, when executed by a processor, implements the steps of the data processing method as described above.
According to the method, the binning split points of each characteristic bin are obtained, and the characteristic data blocks of each characteristic bin are grouped according to the binning split points to generate the corresponding relation between each characteristic bin and the characteristic data block of each characteristic bin; if a binning adjustment instruction is detected, determining a binning to be adjusted and a feature data block to be adjusted of the binning to be adjusted from each feature binning according to the binning adjustment instruction and the corresponding relation; and adjusting the sub-boxes to be adjusted and the characteristic data blocks to be adjusted, and outputting an adjustment result. The method can be applied to characteristic interactive binning of mass data in a big data environment, the characteristic data to be adjusted is directly adjusted without any operation on data blocks which do not need to be adjusted, so that the statistical steps of a large number of irrelevant data blocks can be reduced, the time consumption of statistics is reduced, the data response speed of the characteristic binning is greatly improved, the statistical process is simplified, the statistical efficiency of the mass data is improved, the operation performance and the response speed are obviously improved on the premise of ensuring accurate results, the interactive binning experience is optimized, and the data processing efficiency is greatly improved.
Drawings
FIG. 1 is a schematic diagram of an apparatus architecture of a hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a flow chart illustrating a data processing method according to an embodiment of the present invention;
FIG. 3 is a block diagram of the boxed data in the data processing method of the present invention;
FIG. 4 is a block diagram illustrating binning of data blocks in the data processing method of the present invention;
fig. 5 is a schematic diagram illustrating splitting of boxed data blocks in the data processing method of the present invention.
The objects, features and advantages of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As shown in fig. 1, fig. 1 is a schematic device structure diagram of a hardware operating environment according to an embodiment of the present invention.
The device of the embodiment of the invention can be a PC or a server device.
As shown in fig. 1, the apparatus may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, a communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.
Those skilled in the art will appreciate that the configuration of the apparatus shown in fig. 1 is not intended to be limiting of the apparatus and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
As shown in fig. 1, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a data processing program.
In the device shown in fig. 1, the network interface 1004 is mainly used for connecting to a backend server and performing data communication with the backend server; the user interface 1003 is mainly used for connecting a client (user side) and performing data communication with the client; and the processor 1001 may be configured to call a data processing program stored in the memory 1005 and perform operations in various embodiments of the data processing method described below.
The main idea of the embodiment scheme of the invention is as follows: acquiring the binning split points of each characteristic bin, and grouping the characteristic data blocks of each characteristic bin according to the binning split points to generate corresponding relations between each characteristic bin and the characteristic data blocks of each characteristic bin; if a binning adjustment instruction is detected, determining a binning to be adjusted and a feature data block to be adjusted of the binning to be adjusted from each feature binning according to the binning adjustment instruction and the corresponding relation; and adjusting the sub-boxes to be adjusted and the characteristic data blocks to be adjusted, and outputting an adjustment result. The method can be applied to characteristic interactive binning of mass data in a big data environment, the characteristic data to be adjusted is directly adjusted without any operation on data blocks which do not need to be adjusted, so that the statistical steps of a large number of irrelevant data blocks can be reduced, the time consumption of statistics is reduced, the data response speed of the characteristic binning is greatly improved, the statistical process is simplified, the statistical efficiency of the mass data is improved, the operation performance and the response speed are obviously improved on the premise of ensuring accurate results, the interactive binning experience is optimized, and the data processing efficiency is greatly improved.
In the embodiment of the invention, the situation that in the prior art, a user can adjust the box separation result according to business experience, and the box separation point is changed due to the box separation adjustment, so that the statistical information in the box is changed, and the statistics needs to be carried out again is considered. However, when dealing with massive data, the process of statistics becomes heavy due to the statistics of massive data, and consumes too much system resources, and the response is not timely, which greatly reduces the data processing efficiency of the system.
The invention provides a solution, which can be applied to characteristic interactive binning of mass data in a big data environment, and can directly adjust the characteristic data to be adjusted without any operation on data blocks which do not need to be adjusted, so that the statistical steps of a large number of irrelevant data blocks can be reduced, the time consumption of statistics is reduced, the data response speed of the characteristic binning is greatly improved, the statistical process is further simplified, the statistical efficiency of the mass data is improved, the running performance and the response speed are obviously improved on the premise of ensuring accurate results, the interactive binning experience is optimized, and the data processing efficiency is greatly improved.
Based on the above hardware structure, the embodiment of the data processing method of the present invention is provided.
The invention belongs to the field of financial science and technology (Fintech), and provides a data processing method,
in an embodiment of the data processing method, referring to fig. 2, the data processing method includes:
step S10, acquiring the binning split points of each characteristic bin, and grouping the characteristic data blocks of each characteristic bin according to the binning split points to generate the corresponding relation between each characteristic bin and the characteristic data block of each characteristic bin;
step S20, if a binning adjustment instruction is detected, determining a bin to be adjusted and a feature data block to be adjusted of the bin to be adjusted from each feature bin according to the binning adjustment instruction and the corresponding relation;
and step S30, adjusting the sub-boxes to be adjusted and the characteristic data blocks to be adjusted, and outputting the adjustment result.
The data processing method can be applied to equipment, and comprises the following specific contents:
step S10, acquiring the binning split points of each characteristic bin, and grouping the characteristic data blocks of each characteristic bin according to the binning split points to generate the corresponding relation between each characteristic bin and the characteristic data block of each characteristic bin;
each characteristic box corresponds to a respective box-dividing point, and the box-dividing points correspond to the data boundary of the characteristic box in which the point is located. For example, there currently exists a set of age characteristic bins: feature bin A (0-10 years old), feature bin B (10-20 years old), feature bin C (20-30 years old), and feature bin D (30-40 years old). Each group of feature sub-box is stored with a related feature data block of the age feature. The system equipment acquires all the characteristic boxes and groups the characteristic data blocks according to the box dividing points. Referring to fig. 3, the left content in fig. 3 is a bin 1. The data of the same group can be stored in the same computing node or a plurality of computing nodes, and a plurality of grouped data can also be stored in the same computing node and marked and distinguished. The grouping processing can form corresponding relations between each box and data in the box range, and the data are quickly read and matched in advance. For example, all the feature data blocks in the feature bin a are mapped to the bin 1 data block, all the feature data blocks in the feature bin B are mapped to the bin 2 data block, and the like, so that the bin 1 data block, the bin 2 data block. Therefore, the characteristic binning n and the binning data block n are in mapping correspondence with each other, and the binning data block n is a cache data block of the characteristic binning n and contains all characteristic data blocks of the characteristic binning n.
Step S20, if a binning adjustment instruction is detected, determining a bin to be adjusted and a feature data block to be adjusted of the bin to be adjusted from each feature bin according to the binning adjustment instruction and the corresponding relation;
when a box separation adjusting instruction is detected, the service adjusting requirement for the characteristic box separation currently exists, the service adjusting requirement corresponds to the box separation to be adjusted, and the to-be-adjusted box separation and the corresponding to-be-adjusted characteristic data block can be positioned from each characteristic box separation according to the corresponding relation. For example, the binning adjustment instruction requires splitting and adjusting the feature binning B of 10 to 20 years old, and the feature binning B is obtained as the binning to be adjusted and the corresponding feature data block to be adjusted.
The present embodiment may determine the to-be-adjusted bin to be adjusted through the bin adjustment instruction. For example, the feature classification includes a feature classification A between 0 and 10 years, a feature classification B between 10 and 20 years, a feature classification C between 20 and 30 years and a feature classification D between 30 and 40 years. And the characteristic data of 25 years old is to be adjusted in the binning adjustment instruction, and the characteristic bin C of the bin to be adjusted, which is 20-30 years old, and the corresponding characteristic data block to be adjusted can be positioned through the 25 quantiles and the characteristic bins corresponding to the quantiles in the corresponding relationship.
And step S30, adjusting the sub-boxes to be adjusted and the characteristic data blocks to be adjusted, and outputting the adjustment result.
The adjustment processing mode can be merging and binning or splitting and binning, and is determined according to actual conditions. Specifically, the grouping the feature data blocks of each feature bin according to the bin dividing bit to generate a corresponding relationship between each feature bin and the feature data block of each feature bin includes:
caching the characteristic data blocks of each characteristic box, and grouping the characteristic data blocks of each characteristic box according to box-dividing positions to generate a corresponding relation between each characteristic box and the characteristic data block of each characteristic box;
the adjusting the to-be-adjusted sub-box and the to-be-adjusted characteristic data block and outputting an adjusting result includes:
and adjusting the sub-boxes to be adjusted and the characteristic data blocks to be adjusted in the cache, and outputting an adjustment result.
In the embodiment, a cache grouping mechanism is adopted, so that the adjustment of the data in the data boxes is realized in the cache, the excessive consumption of system resources in the process of mass data statistics is avoided, and the response speed and the data processing efficiency are improved. Specifically, assuming that the feature binning needs to be adjusted according to the service adjustment requirement at present, the device system obtains all the feature binning and maps all the feature binning into the cache.
Further, the adjusting the to-be-adjusted binning and the to-be-adjusted feature data block in the cache includes:
step A1, obtaining the to-be-adjusted quantile point of the to-be-adjusted characteristic data block in a cache, and obtaining the instruction type of the binning adjustment instruction;
acquiring the quantile points to be adjusted of the feature data to be adjusted, wherein the feature data to be adjusted is represented in a data block form, for example, 4 feature data to be adjusted exist currently: data blocks A between 0 and 10 years of age, data blocks B between 10 and 20 years of age, data blocks C between 20 and 30 years of age, and data blocks D between 30 and 40 years of age. The quantile point list of all the data blocks which can be obtained according to the data blocks is [10,20,30,40], so that the quantile points to be adjusted of the feature data to be adjusted can be determined according to the quantile point list in the cache. For example, if the characteristic data to be adjusted is a data block D between 30 and 40 years old, the corresponding quantile point to be adjusted is 40.
It can be understood that the binning adjustment instruction includes two types, a binning splitting type and a binning merging type, and in order to avoid an error in the adjustment process, the binning adjustment instruction needs to be type-distinguished to obtain an instruction type.
Step A2, according to the instruction type, the quantile point to be adjusted and the binning quantile point, performing cache adjustment processing on the quantile to be adjusted and the feature data block to be adjusted.
The difference of the instruction types represents the difference of the adjustment flow, and the quantile to be adjusted and the target quantile can be positioned to the object to be adjusted, so that the characteristic data to be adjusted can be adjusted according to the instruction types, the quantile to be adjusted and the target quantile, and the target characteristic data block can be obtained.
Further, the performing, according to the instruction type, the binning split point to be adjusted, and the binning split point, cache adjustment processing on the binning block to be adjusted and the feature data block to be adjusted includes:
step A21, if the instruction type is a binning splitting type, splitting the to-be-adjusted binning and the to-be-adjusted feature data block according to the to-be-adjusted binning point and the binning point to obtain a plurality of target splitting bins and target splitting data blocks corresponding to the target splitting bins;
if the instruction type is a binning splitting type, the current binning adjusting instruction is proved to split a specific binning of all the characteristic binning, and the characteristic data block to be adjusted is the object to be split.
The split of the sub-box needs to split the sub-box to be adjusted into two sub-boxes, split the characteristic data block to be adjusted into two data blocks, namely two sub-boxes and the corresponding characteristic data block are newly generated, then the number of the sub-box with the original sub-box number larger than k +1 is increased by one, and the number of the corresponding characteristic data is increased by one. Specifically referring to fig. 4, it is assumed that the bin k data block is a to-be-adjusted feature data block, k is a to-be-adjusted quantile point, and k +1 is a target quantile point, so that the bin k data block (i.e., the to-be-adjusted feature data block) can be split into the bin k data block and the bin k +1 data block (i.e., a plurality of target split data blocks) according to k (i.e., the to-be-adjusted quantile point) and k +1 (i.e., the target quantile point), and the bin k data block maps the corresponding feature bin k, which means that the feature bin k is split into the feature bin k and the feature bin k + 1. That is to say, when the feature data block to be adjusted is split, only the current feature data block to be adjusted is adjusted, and no operation is required to be performed on other data.
Step A22, obtaining first statistical information of each target split data, and generating a cache adjustment result according to each target split sub-box, the target split data corresponding to each target split sub-box, and the first statistical information corresponding to each target split data.
Corresponding statistical information is stored in each target split data block, for example, the target split data block allocates information such as feature data in the original feature data block to be adjusted, and then the statistical information (such as event information and non-event information) in the original feature data block to be adjusted is reallocated, so that the statistical information corresponding to each target split data block can be obtained. And each target splits the data block and the corresponding statistical information, so that a cache adjustment result can be generated.
It should be noted that the essence of the binning adjustment is to generate new data binning point information, for example, the original binning point is [10,20,30,40], and the adjusted binning point is [10,30,40 ]. As the binning split-site changes, the statistical information in each bin changes, so all data needs to be traversed to recalculate the binning statistical information in each bin.
Further, the performing, according to the instruction type, the binning split point to be adjusted, and the binning split point, cache adjustment processing on the binning block to be adjusted and the feature data block to be adjusted includes:
step A23, if the instruction type is a binning merging type, merging the to-be-adjusted binning block and the to-be-adjusted feature data block according to the to-be-adjusted binning point and the binning point to obtain a target merging bin and a target merging data block corresponding to the target merging bin;
step A24, acquiring second statistical information of the sub-boxes to be adjusted, and adding and summarizing the second statistical information to generate target statistical information;
step A25, generating a cache adjustment result according to the target merged data block and the target statistical information.
If the instruction type is a binning merging type, the current binning adjusting instruction is proved to merge a specific bin in all the feature bins, and then a plurality of feature bins are involved, at this time, the feature data to be adjusted is the object to be merged, and the feature data to be adjusted is a plurality of.
The split-box combination needs to combine a plurality of corresponding data into one data block, namely, two or more data blocks are combined, and then the split-box number with the original split-box number larger than k +1 is reduced by one, and the corresponding data number is also reduced by one. Specifically referring to fig. 5, it is assumed that a bin k data block and a bin k +1 data block are to-be-adjusted feature data blocks, k and k +1 are to-be-adjusted quantiles, and k is a bin quantile point, so according to k and k +1 (i.e., to-be-adjusted quantiles) and k (i.e., bin quantiles), bin k data and bin k +1 (i.e., to-be-adjusted feature data) can be merged into bin k data (i.e., target merged data), and the bin k data block and the bin k +1 data block merged in the cache map corresponding feature bin k and feature bin k +1, which means that the feature bin k and the feature bin k +1 are merged into the feature bin k. That is to say, when merging the feature data blocks to be adjusted in the cache, only the feature data blocks to be adjusted are adjusted, and no operation is required to be performed on other data blocks.
And all the statistical information of the characteristic data blocks to be adjusted needs to be acquired and then added and summarized to obtain target statistical information. The original statistical information in the feature data block to be adjusted is obtained, and as the feature data block to be adjusted is combined, the corresponding statistical information also needs to be combined. For example, the a statistic information of the a data block and the B statistic information of the B data block, as the a and B data blocks are merged, the a statistic information and the B statistic information will also be merged, thereby generating the target statistic information.
After the data blocks are combined, the original box numbers of the box numbers which are more than k +1 are reduced by one, and the corresponding data block numbers are reduced by one. The merging process is defined by the formula:
Figure BDA0002287225140000121
Figure BDA0002287225140000122
and after the target combined data block and the target statistical information are obtained, combining the target combined data block and the target statistical information to generate a target characteristic data block.
According to the method, the binning split points of each characteristic bin are obtained, and the characteristic data blocks of each characteristic bin are grouped according to the binning split points to generate the corresponding relation between each characteristic bin and the characteristic data block of each characteristic bin; if a binning adjustment instruction is detected, determining a binning to be adjusted and a feature data block to be adjusted of the binning to be adjusted from each feature binning according to the binning adjustment instruction and the corresponding relation; and adjusting the sub-boxes to be adjusted and the characteristic data blocks to be adjusted, and outputting an adjustment result. The method can be applied to characteristic interactive binning of mass data in a big data environment, the characteristic data to be adjusted is directly adjusted without any operation on data blocks which do not need to be adjusted, so that the statistical steps of a large number of irrelevant data blocks can be reduced, the time consumption of statistics is reduced, the data response speed of the characteristic binning is greatly improved, the statistical process is simplified, the statistical efficiency of the mass data is improved, the operation performance and the response speed are obviously improved on the premise of ensuring accurate results, the interactive binning experience is optimized, and the data processing efficiency is greatly improved.
Further, based on the first embodiment, a second embodiment of the data processing method according to the present invention is provided, in this embodiment, after performing adjustment processing on the to-be-adjusted binning and the to-be-adjusted feature data block and outputting an adjustment processing result, the method further includes:
step a, counting the information value of each characteristic sub-box in the adjustment processing result;
the Information Value is an IV Value, the IV is called Information Value completely, variable prediction capability can be measured, and the greater the IV Value is, the better the box separation processing effect represented by the IV Value is. In this embodiment, the information value of each feature in the statistical adjustment processing result is binned.
Specifically, the information value of each feature bin in the statistical target feature bin list comprises:
step a1, counting the event value and non-event value of each characteristic box in the adjustment processing result to obtain woe value;
and a step a2, obtaining information value according to the event value, the non-event value and the woe value.
In particular, the following algorithm can be referred to:
is characterized in that: x;
the number of the boxes is as follows: n, representing the number of segments into which the feature X is divided after sorting;
case: xi,1<=i<N, representing a piece of data of the feature X after sorting;
binning and quantile: s, comprising n-1 different values, in turn Si,1<=i<N-1, and Si<Si+1
Number of each bin event: t iseventComprising n values, in turn
Figure BDA0002287225140000131
1<=i<=n;
Number of non-events per bin: t isnon-eventComprising n values, in turn
Figure BDA0002287225140000132
1<=i<=n;
Total number of events: n is a radical ofevent
Figure BDA0002287225140000133
Total number of non-events: n is a radical ofnon-event
Figure BDA0002287225140000134
The values of each of the bins woe are,
Figure BDA0002287225140000135
the values of each bin iv are taken out of bins,
Figure BDA0002287225140000136
finally, the box separation effect evaluation index IV value,
Figure BDA0002287225140000137
according to the event value and the non-event value in the algorithm, woe values corresponding to the target feature sub-boxes can be calculated and obtained, and the IV value of the target feature sub-boxes is obtained based on woe values, wherein the IV value is the information value.
B, if the information value is greater than or equal to a preset value, determining that the adjustment processing effect is qualified;
and c, if the information value is less than the preset value, determining that the adjusting treatment effect is unqualified.
In this embodiment, the reference standard for information value is preset, and can be specifically set according to actual service requirements. If the information value is greater than or equal to the preset value, the box separation processing effect of the current adjustment processing result is proved to be qualified, and if the information value is less than the preset value, the box separation processing effect of the current adjustment processing result is proved to be unqualified. For example, if the preset value is a and the information value is b, if b is greater than a, it is proved that the current binning adjustment processing has an obvious trend effect, and the system device confirms that the binning processing effect of the adjustment processing result is qualified; if b is smaller than a, the trend effect of the current box separation adjustment treatment is proved to be unobvious, and the system equipment confirms that the box separation treatment effect of the adjustment treatment result is unqualified.
In addition, an embodiment of the present invention further provides a data processing apparatus, where the data processing apparatus includes:
the relation module is used for acquiring the binning split points of each characteristic bin, and grouping the characteristic data blocks of each characteristic bin according to the binning split points to generate the corresponding relation between each characteristic bin and the characteristic data block of each characteristic bin;
the determining module is used for determining the sub-boxes to be adjusted and the feature data blocks to be adjusted of the sub-boxes to be adjusted from all the feature sub-boxes according to the sub-box adjusting instructions and the corresponding relations if the sub-box adjusting instructions are detected;
and the adjusting module is used for adjusting the sub-boxes to be adjusted and the characteristic data blocks to be adjusted and outputting an adjusting result.
Optionally, the relationship module comprises:
the cache processing unit is used for caching the characteristic data blocks of each characteristic sub-box and grouping the characteristic data blocks of each characteristic sub-box according to the sub-box sub-position points so as to generate the corresponding relation between each characteristic sub-box and the characteristic data block of each characteristic sub-box;
the adjustment module includes:
and the cache adjusting unit is used for adjusting the sub-boxes to be adjusted and the characteristic data blocks to be adjusted in the cache and outputting an adjusting result.
Optionally, the cache adjusting unit includes:
the instruction type subunit is used for acquiring the to-be-adjusted quantile points of the to-be-adjusted characteristic data block in a cache and acquiring the instruction type of the box dividing adjustment instruction;
and the cache adjusting subunit is used for performing cache adjusting processing on the to-be-adjusted sub-box and the to-be-adjusted feature data block according to the instruction type, the to-be-adjusted sub-position point and the sub-box sub-position point.
Optionally, the cache adjusting subunit is configured to:
if the instruction type is a binning splitting type, splitting the to-be-adjusted binning and the to-be-adjusted feature data block according to the to-be-adjusted binning point and the binning splitting point to obtain a plurality of target splitting bins and target splitting data blocks corresponding to the target splitting bins;
and acquiring first statistical information of each target split data, and generating a cache adjustment result according to each target split sub-box, the target split data corresponding to each target split sub-box and the first statistical information corresponding to each target split data.
Optionally, the cache adjusting subunit is configured to:
if the instruction type is a binning merging type, merging the to-be-adjusted binning block and the to-be-adjusted feature data block according to the to-be-adjusted binning point and the binning point to obtain a target merging bin and a target merging data block corresponding to the target merging bin;
acquiring second statistical information of a sub-box to be adjusted, and adding and summarizing the second statistical information to generate target statistical information;
and generating a cache adjustment result according to the target merged data block and the target statistical information.
Optionally, the data processing apparatus further includes:
the statistical module is used for counting the information value of each characteristic sub-box in the adjustment processing result;
the qualified module is used for confirming that the adjustment processing effect is qualified if the information value is greater than or equal to the preset value;
and the disqualification module is used for confirming that the adjustment processing effect is disqualified if the information value is less than the preset value.
Optionally, the statistics module includes:
the statistical unit is used for counting the event value and non-event value of each characteristic sub-box in the adjustment processing result to obtain woe value;
and the information value unit is used for obtaining the information value according to the event value, the non-event value and the woe value.
In addition, an embodiment of the present invention further provides an apparatus, where the apparatus includes: a memory 109, a processor 110 and a data processing program stored on the memory 109 and executable on the processor 110, the data processing program implementing the steps of the embodiments of the data processing method described above when executed by the processor 110.
Furthermore, the present invention also provides a computer storage medium storing one or more programs, which can be further executed by one or more processors for implementing the steps of the embodiments of the data processing method described above.
The specific implementation of the device and the storage medium (i.e., the computer storage medium) of the present invention is basically the same as the embodiments of the data processing method described above, and will not be described herein again.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be substantially or partially embodied in the form of a software product, which is stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for causing a device (e.g. mobile phone, computer, server, or network device) to execute the method according to the embodiments of the present invention.
While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (10)

1. A data processing method, characterized in that the data processing method comprises:
acquiring the binning split points of each characteristic bin, and grouping the characteristic data blocks of each characteristic bin according to the binning split points to generate corresponding relations between each characteristic bin and the characteristic data blocks of each characteristic bin;
if a binning adjustment instruction is detected, determining a binning to be adjusted and a feature data block to be adjusted of the binning to be adjusted from each feature binning according to the binning adjustment instruction and the corresponding relation;
and adjusting the sub-boxes to be adjusted and the characteristic data blocks to be adjusted, and outputting an adjustment result.
2. The data processing method according to claim 1, wherein the grouping the feature data blocks of the feature bins according to the binning split bit to generate a correspondence between the feature bins and the feature data blocks of the feature bins comprises:
caching the characteristic data blocks of each characteristic box, and grouping the characteristic data blocks of each characteristic box according to box-dividing positions to generate a corresponding relation between each characteristic box and the characteristic data block of each characteristic box;
the adjusting the to-be-adjusted sub-box and the to-be-adjusted characteristic data block and outputting an adjusting result includes:
and adjusting the sub-boxes to be adjusted and the characteristic data blocks to be adjusted in the cache, and outputting an adjustment result.
3. The data processing method according to claim 2, wherein the adjusting the to-be-adjusted binning and the to-be-adjusted feature data block in the buffer includes:
acquiring a to-be-adjusted quantile point of the to-be-adjusted characteristic data block in a cache, and acquiring an instruction type of the box-dividing adjusting instruction;
and performing cache adjustment processing on the sub-box to be adjusted and the feature data block to be adjusted according to the instruction type, the sub-position point to be adjusted and the sub-box sub-position point.
4. The data processing method according to claim 3, wherein the performing the cache adjustment processing on the to-be-adjusted binning and the to-be-adjusted feature data block according to the instruction type, the to-be-adjusted binning point, and the binning point comprises:
if the instruction type is a binning splitting type, splitting the to-be-adjusted binning and the to-be-adjusted feature data block according to the to-be-adjusted binning point and the binning splitting point to obtain a plurality of target splitting bins and target splitting data blocks corresponding to the target splitting bins;
and acquiring first statistical information of each target split data, and generating a cache adjustment result according to each target split sub-box, the target split data corresponding to each target split sub-box and the first statistical information corresponding to each target split data.
5. The data processing method according to claim 3, wherein the performing the cache adjustment processing on the to-be-adjusted binning and the to-be-adjusted feature data block according to the instruction type, the to-be-adjusted binning point, and the binning point comprises:
if the instruction type is a binning merging type, merging the to-be-adjusted binning block and the to-be-adjusted feature data block according to the to-be-adjusted binning point and the binning point to obtain a target merging bin and a target merging data block corresponding to the target merging bin;
acquiring second statistical information of a sub-box to be adjusted, and adding and summarizing the second statistical information to generate target statistical information;
and generating a cache adjustment result according to the target merged data block and the target statistical information.
6. The data processing method according to claim 1, wherein after adjusting the to-be-adjusted binning and the to-be-adjusted feature data block and outputting an adjustment processing result, the method further comprises:
counting the information value of each characteristic sub-box in the adjustment processing result;
if the information value is greater than or equal to the preset value, the adjustment processing effect is determined to be qualified;
and if the information value is less than the preset value, determining that the adjusting treatment effect is unqualified.
7. The data processing method of claim 6, wherein the statistically adjusting the information value of each feature bin in the processing result comprises:
counting the event value and non-event value of each feature sub-box in the adjustment processing result to obtain woe value;
and obtaining information value according to the event value, the non-event value and the woe value.
8. A data processing apparatus, characterized in that the data processing apparatus comprises:
the relation module is used for acquiring the binning split points of each characteristic bin, and grouping the characteristic data blocks of each characteristic bin according to the binning split points to generate the corresponding relation between each characteristic bin and the characteristic data block of each characteristic bin;
the determining module is used for determining the sub-boxes to be adjusted and the feature data blocks to be adjusted of the sub-boxes to be adjusted from all the feature sub-boxes according to the sub-box adjusting instructions and the corresponding relations if the sub-box adjusting instructions are detected;
and the adjusting module is used for adjusting the sub-boxes to be adjusted and the characteristic data blocks to be adjusted and outputting an adjusting result.
9. An apparatus, characterized in that the apparatus comprises: memory, processor and data processing program stored on the memory and executable on the processor, which when executed by the processor implements the steps of the data processing method according to any one of claims 1 to 7.
10. A storage medium, characterized in that the storage medium has stored thereon a data processing program which, when executed by a processor, implements the steps of the data processing method according to any one of claims 1 to 7.
CN201911177388.1A 2019-11-25 2019-11-25 Data processing method, device, equipment and storage medium Pending CN110909085A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911177388.1A CN110909085A (en) 2019-11-25 2019-11-25 Data processing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911177388.1A CN110909085A (en) 2019-11-25 2019-11-25 Data processing method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN110909085A true CN110909085A (en) 2020-03-24

Family

ID=69819759

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911177388.1A Pending CN110909085A (en) 2019-11-25 2019-11-25 Data processing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110909085A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111506485A (en) * 2020-04-15 2020-08-07 深圳前海微众银行股份有限公司 Feature binning method, device, equipment and computer-readable storage medium
CN111507479A (en) * 2020-04-15 2020-08-07 深圳前海微众银行股份有限公司 Feature binning method, device, equipment and computer-readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130325825A1 (en) * 2012-05-29 2013-12-05 Scott Pope Systems And Methods For Quantile Estimation In A Distributed Data System
US20160133145A1 (en) * 2014-11-10 2016-05-12 Xerox Corporation Method and apparatus for defining performance milestone track for planned process
CN109815267A (en) * 2018-12-21 2019-05-28 天翼征信有限公司 The branch mailbox optimization method and system, storage medium and terminal of feature in data modeling
CN110084376A (en) * 2019-04-30 2019-08-02 成都四方伟业软件股份有限公司 To the method and device of the automatic branch mailbox of data
CN110245140A (en) * 2019-06-12 2019-09-17 同盾控股有限公司 Data branch mailbox processing method and processing device, electronic equipment and computer-readable medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130325825A1 (en) * 2012-05-29 2013-12-05 Scott Pope Systems And Methods For Quantile Estimation In A Distributed Data System
US20160133145A1 (en) * 2014-11-10 2016-05-12 Xerox Corporation Method and apparatus for defining performance milestone track for planned process
CN109815267A (en) * 2018-12-21 2019-05-28 天翼征信有限公司 The branch mailbox optimization method and system, storage medium and terminal of feature in data modeling
CN110084376A (en) * 2019-04-30 2019-08-02 成都四方伟业软件股份有限公司 To the method and device of the automatic branch mailbox of data
CN110245140A (en) * 2019-06-12 2019-09-17 同盾控股有限公司 Data branch mailbox processing method and processing device, electronic equipment and computer-readable medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
COX, NICHOLAS J.: "Speaking Stata: Matrices as look-up tables", 《STATA JOURNAL》, 5 February 2013 (2013-02-05), pages 748 - 758 *
夏晨琦: "局部最优分箱及其在评分卡模型中的应用", 《统计与决策》, 31 July 2019 (2019-07-31), pages 63 - 67 *
林一帆: "基于机器学习的信用评分模型研究", 《中国优秀硕士学位论文全文数据库(信息科技辑)》, 15 September 2019 (2019-09-15), pages 140 - 133 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111506485A (en) * 2020-04-15 2020-08-07 深圳前海微众银行股份有限公司 Feature binning method, device, equipment and computer-readable storage medium
CN111507479A (en) * 2020-04-15 2020-08-07 深圳前海微众银行股份有限公司 Feature binning method, device, equipment and computer-readable storage medium
CN111506485B (en) * 2020-04-15 2021-07-27 深圳前海微众银行股份有限公司 Feature binning method, device, equipment and computer-readable storage medium
CN111507479B (en) * 2020-04-15 2021-08-10 深圳前海微众银行股份有限公司 Feature binning method, device, equipment and computer-readable storage medium

Similar Documents

Publication Publication Date Title
US11915104B2 (en) Normalizing text attributes for machine learning models
US20220004480A1 (en) Log data collection method, log data collection device, storage medium, and log data collection system
US10402427B2 (en) System and method for analyzing result of clustering massive data
CN111695675B (en) Federal learning model training method and related equipment
EP3961384A1 (en) Automatic derivation of software engineering artifact attributes from product or service development concepts
CN113312361B (en) Track query method, device, equipment, storage medium and computer program product
CN111507479B (en) Feature binning method, device, equipment and computer-readable storage medium
CN107016115B (en) Data export method and device, computer readable storage medium and electronic equipment
CN111897660B (en) Model deployment method, model deployment device and terminal equipment
CN113468226A (en) Service processing method, device, electronic equipment and storage medium
CN110909085A (en) Data processing method, device, equipment and storage medium
WO2021258512A1 (en) Data aggregation processing apparatus and method, and storage medium
CN113449854A (en) Method and device for quantifying mixing precision of network model and computer storage medium
US9473572B2 (en) Selecting a target server for a workload with a lowest adjusted cost based on component values
CA3131106A1 (en) Method, device and system for processing service data by merging sorting algorithm
US20140344328A1 (en) Data collection and distribution management
WO2023071566A1 (en) Data processing method and apparatus, computer device, computer-readable storage medium, and computer program product
CN111783843A (en) Feature selection method and device and computer system
CN116089367A (en) Dynamic barrel dividing method, device, electronic equipment and medium
CN115564578B (en) Fraud recognition model generation method
CN111737371B (en) Data flow detection classification method and device capable of dynamically predicting
CN112115316B (en) Box separation method and device, electronic equipment and storage medium
CN104090895A (en) Method, device, server and system for obtaining cardinal number
US11663184B2 (en) Information processing method of grouping data, information processing system for grouping data, and non-transitory computer readable storage medium
CN113434273A (en) Data processing method, device, system and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination