CN112418258A - Feature discretization method and device - Google Patents

Feature discretization method and device Download PDF

Info

Publication number
CN112418258A
CN112418258A CN201910779357.7A CN201910779357A CN112418258A CN 112418258 A CN112418258 A CN 112418258A CN 201910779357 A CN201910779357 A CN 201910779357A CN 112418258 A CN112418258 A CN 112418258A
Authority
CN
China
Prior art keywords
discrete intervals
intervals
adjacent
interval
distance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910779357.7A
Other languages
Chinese (zh)
Inventor
刘洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Zhenshi Information Technology Co Ltd
Original Assignee
Beijing Jingdong Zhenshi Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Zhenshi Information Technology Co Ltd filed Critical Beijing Jingdong Zhenshi Information Technology Co Ltd
Priority to CN201910779357.7A priority Critical patent/CN112418258A/en
Publication of CN112418258A publication Critical patent/CN112418258A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2113Selection of the most significant subset of features by ranking or filtering the set of features, e.g. using a measure of variance or of feature cross-correlation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Abstract

The invention discloses a feature discretization method and a feature discretization device, and relates to the technical field of computers. One embodiment of the method comprises: sequencing the samples according to the values of the characteristics; determining each demarcation point of the characteristic according to the information gain of each value-taking point of the characteristic so as to discretize the characteristic into a plurality of discrete intervals; and clustering and fusing the discrete intervals to obtain a discretization result. According to the embodiment, the distribution of the values of the independent variables and the relation between the values of the independent variables and the sample labels can be considered, and a semi-supervised continuous feature discretization method is formed.

Description

Feature discretization method and device
Technical Field
The invention relates to the technical field of computers, in particular to a method and a device for discretizing features.
Background
The feature discretization is mainly characterized in that a plurality of intervals are explored from the continuous value of the independent variable, so that the value states in the same interval are similar. Currently, methods for performing feature discretization are divided into three types: 1. and discretizing the continuous characteristic values according to quantiles. The method averagely divides the independent variables according to the values, and the number of samples contained in each interval is the same. 2. And discretizing the characteristic value by using a clustering method. This method uses an unsupervised clustering method, such as Kmeans, to divide the values of the independent variables into several compartments. 3. The features are discretized according to a segmentation of the decision tree. The method searches for a point which can bring the maximum information gain according to the definition of the information gain in the decision tree, and divides the value of the independent variable into a plurality of sections.
In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:
the first and second methods are statistical and unsupervised learning methods, only the distribution of the values of the independent variables is considered, but the relationship between the values of the independent variables and the sample labels is not considered, and the sample label corresponding to one value interval actually represents the characteristic of the value interval to a great extent. Meanwhile, the former two methods also have to manually specify how many intervals the value of the argument is to be discretized into. Although the third method considers the relationship between the independent variable value and the sample label, the process of finding the segmentation point is based on a plurality of independent variables, and the points which can be segmented in the single independent variable value cannot be sufficiently mined. Meanwhile, when searching for the division point, it is necessary to manually specify the stop condition.
Disclosure of Invention
In view of this, embodiments of the present invention provide a discretization method and apparatus, which can form a semi-supervised continuous feature discretization method by considering both the distribution of the argument values themselves and the relationship between the argument values and the sample labels.
To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a feature discretization method including:
sequencing the samples according to the values of the characteristics;
determining each demarcation point of the characteristic according to the information gain of each value-taking point of the characteristic so as to discretize the characteristic into a plurality of discrete intervals;
and clustering and fusing the discrete intervals to obtain a discretization result.
Optionally, determining each demarcation point of the feature according to the information gain of each value-taking point of the feature includes:
searching adjacent value point pairs with different sample labels according to the sample label of each characteristic point; and taking the intermediate value of the values of the adjacent value pair as the demarcation point.
Optionally, performing cluster fusion on the several discrete intervals includes:
comparing the distance between every two adjacent discrete intervals, and fusing the two adjacent discrete intervals with the minimum distance into one interval to obtain an updated discrete interval;
and comparing the distance between every two adjacent discrete intervals in the updated discrete intervals, fusing the two adjacent discrete intervals with the minimum distance into one interval, and repeating the steps until the number of the updated discrete intervals is equal to the number of the preset intervals.
Optionally, the preset number of intervals is determined according to the following steps:
comparing the distance between every two adjacent discrete intervals, and fusing the two adjacent discrete intervals with the minimum distance into one interval to obtain an updated discrete interval; comparing the distance between every two adjacent discrete intervals in the updated discrete intervals, fusing the two adjacent discrete intervals with the minimum distance into one interval, and circulating the steps until all the discrete intervals are fused into one interval to obtain fused data; the fused data includes: the distance between two adjacent discrete intervals fused each time and the number of the updated discrete intervals;
and fitting a curve of the distance with respect to the number of discrete intervals according to the fusion data, and then determining the number of preset intervals by adopting an elbow method.
According to a second aspect of the embodiments of the present invention, there is provided a feature discretization apparatus including:
the preprocessing module sequences the samples according to the values of the characteristics;
the discrete module is used for determining each boundary point of the characteristic according to the information gain of each value-taking point of the characteristic so as to discretize the characteristic into a plurality of discrete intervals;
and the fusion module is used for clustering and fusing the discrete intervals to obtain a discretization result.
Optionally, the determining, by the discrete module, each demarcation point of the feature according to the information gain of each value-taking point of the feature includes:
searching adjacent value point pairs with different sample labels according to the sample label of each characteristic point; and taking the intermediate value of the values of the adjacent value pair as the demarcation point.
Optionally, the clustering and fusing the discrete intervals by the fusion module includes:
comparing the distance between every two adjacent discrete intervals, and fusing the two adjacent discrete intervals with the minimum distance into one interval to obtain an updated discrete interval;
and comparing the distance between every two adjacent discrete intervals in the updated discrete intervals, fusing the two adjacent discrete intervals with the minimum distance into one interval, and repeating the steps until the number of the updated discrete intervals is equal to the number of the preset intervals.
Optionally, the fusion module is further configured to: determining the number of the preset intervals according to the following steps:
comparing the distance between every two adjacent discrete intervals, and fusing the two adjacent discrete intervals with the minimum distance into one interval to obtain an updated discrete interval; comparing the distance between every two adjacent discrete intervals in the updated discrete intervals, fusing the two adjacent discrete intervals with the minimum distance into one interval, and circulating the steps until all the discrete intervals are fused into one interval to obtain fused data; the fused data includes: the distance between two adjacent discrete intervals fused each time and the number of the updated discrete intervals;
and fitting a curve of the distance with respect to the number of discrete intervals according to the fusion data, and then determining the number of preset intervals by adopting an elbow method.
According to a third aspect of embodiments of the present invention, there is provided a feature discretization electronic device including:
one or more processors;
a storage device for storing one or more programs,
when the one or more programs are executed by the one or more processors, the one or more processors implement the feature discretization method provided in the first aspect of the embodiments of the present invention.
According to a fourth aspect of the embodiments of the present invention, there is provided a computer-readable medium on which a computer program is stored, the program, when executed by a processor, implementing the method for discretizing features provided by the first aspect of the embodiments of the present invention.
One embodiment of the above invention has the following advantages or benefits: the characteristics are discretized into a plurality of discrete intervals according to the information gain of each value taking point of the characteristics, so that the values of the independent variables can be initially discretized by using the supervised information carried by the sample labels; through clustering fusion, the result of the initial discretization can be further optimized by combining the unsupervised information carried by the sample label, namely the distribution condition of the characteristic values, so that a semi-supervised continuous characteristic discretization method is formed.
Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
FIG. 1 is a schematic diagram of a main flow of a feature discretization method according to an embodiment of the invention;
FIG. 2 is a schematic diagram of fitting a curve of distance versus number of discrete intervals from fused data;
FIG. 3 is a schematic diagram of the main blocks of a feature discretization arrangement in accordance with an embodiment of the present invention;
FIG. 4 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;
fig. 5 is a schematic block diagram of a computer system suitable for use in implementing a terminal device or server of an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
According to a first aspect of embodiments of the present invention, a method of discretizing a feature is provided. Fig. 1 is a schematic diagram of a main flow of a feature discretization method according to an embodiment of the present invention, and as shown in fig. 1, the feature discretization method according to an embodiment of the present invention includes: step S101, step S102, and step S103.
In step S101, the samples are sorted according to the value of the feature. Generally, the sorting is performed in the order from small to large, from large to small, and the like, and in the actual application process, other sorting modes can be selected according to the actual situation.
And S102, determining each boundary point of the characteristic according to the information gain of each value-taking point of the characteristic so as to discretize the characteristic into a plurality of discrete intervals.
Assume that a set of samples X contains n1A positive sample, n0A negative sample in a ratio of
Figure BDA0002176080980000051
The entropy of this sample set is
Figure BDA0002176080980000052
Figure BDA0002176080980000053
Suppose an operation A divides a sample X into m parts, each part being recorded as XiThen the entropy corresponding to the split operation is
Figure BDA0002176080980000054
Finally, the definition of the information gain of the segmentation operation a is denoted as H (X) -H (X | a). In the information gain, the more the change of which section the value-taking point belongs to, the larger the information amount of the value-taking point is, the larger the corresponding information gain is. The features are discretized into a plurality of discrete intervals according to the information gain of each value-taking point of the features, which can be used forAnd (4) carrying out initial discretization on the value of the independent variable by using the supervised information carried by the sample label.
In the practical application process, the value taking point with larger information gain can be selected as the demarcation point of the characteristic value, for example, the value taking point with the maximum information gain or the value taking point with the information gain not less than the preset information gain threshold. Of course, the demarcation points of the different intervals may also be defined as: after all samples are sorted according to the values of the features, if two adjacent samples belong to different intervals, the intermediate value of the values of the two samples on the features is a boundary point of the different intervals.
In terms of information theory, the purpose of discretization is to quantify more information, and therefore, it is possible to discretize a continuous feature by dividing at a value-taking point that gives the maximum information gain. In some embodiments, determining the respective demarcation point for the feature based on the information gain for each of the valued points of the feature comprises: step A: traversing all value points of the characteristics, calculating information gain, determining a demarcation point according to the value point with the maximum information gain to divide the sample into two parts, and then executing the step B; and B: and B, respectively and sequentially carrying out the operation of the step A on the two divided parts until a stop condition is reached to obtain each demarcation point of the characteristics. Depending on the nature of the information gain, the point that can bring the maximum information gain must be at the demarcation point between different classes of samples. In view of this, the cut point can be determined directly from the sample label of each value-taking point. Specifically, determining each demarcation point of the feature according to the information gain of each value-taking point of the feature includes: searching adjacent value point pairs with different sample labels according to the sample label of each characteristic point; and taking the intermediate value of the values of the adjacent value pair as the demarcation point. Illustratively, for example, after classifying customers, and taking the sample labels as customer loss or customer loss, the samples are sorted according to the sample points of the feature T, and if the sample label of the xth sample point is customer loss and the sample label of the (x +1) th sample point is customer loss, the middle value between the value of the feature T at the xth sample point and the value of the (x +1) th sample point is a boundary point. And determining a demarcation point according to the value taking point with the maximum information gain, so that the purity of each divided discrete interval is higher, and the physical significance of the characteristics is better described.
In the step, the value of the feature is subjected to preliminary discretization, the feature is discretized into a plurality of discrete intervals, the number of the intervals can be selectively determined according to the actual situation, and the method is not particularly limited in this respect.
And S103, clustering and fusing the discrete intervals to obtain a discretization result. Through clustering fusion, the initial discretization result can be further optimized by combining the unsupervised information carried by the sample label, namely the distribution condition of the characteristic values.
In the process of cluster fusion, a person skilled in the art can select a suitable clustering method according to actual conditions, for example, using algorithms such as K-means and K-center points. Taking the hierarchical clustering method for clustering and merging as an example, the method specifically comprises the following steps: and adopting an iteration strategy, fusing two nearest value intervals into one interval in each iteration step, and finding the two nearest intervals from the updated value intervals for fusion. And repeating the operation until the number of the value intervals reaches the specified requirement. The distance between two value intervals is represented as the difference between the maximum value in the two intervals and the minimum value in the two intervals, i.e. if the values contained in the two value intervals are respectively represented as a set S1And S2The distance is defined as max { S }1∪S2}-min{S1∪S2}. In the hierarchical clustering process, the distance between every two value intervals generally needs to be compared, and the two with the shortest distance are found out. That is, when there are n value ranges in total, it is necessary to do
Figure BDA0002176080980000071
And (6) performing secondary comparison.
For the continuous features, the value interval closest to one value interval must be adjacent to the value interval. Therefore, in an optional embodiment, performing cluster fusion on the several discrete intervals includes: comparing the distance between every two adjacent discrete intervals, and fusing the two adjacent discrete intervals with the minimum distance into one interval to obtain an updated discrete interval; and comparing the distance between every two adjacent discrete intervals in the updated discrete intervals, fusing the two adjacent discrete intervals with the minimum distance into one interval, and repeating the steps until the number of the updated discrete intervals is equal to the number of the preset intervals. In this example only the distance between adjacent value intervals is compared. By adopting an iteration strategy, two discretization intervals with the nearest distance are fused into one interval in each iteration step, and only (n-1) comparison is needed under the condition that n value intervals exist, so that the complexity of a clustering algorithm is greatly reduced, and the operation efficiency is improved.
The value of the preset interval number can be selectively set according to the actual situation. Optionally, the preset number of intervals is determined according to the following steps: comparing the distance between every two adjacent discrete intervals, and fusing the two adjacent discrete intervals with the minimum distance into one interval to obtain an updated discrete interval; comparing the distance between every two adjacent discrete intervals in the updated discrete intervals, fusing the two adjacent discrete intervals with the minimum distance into one interval, and circulating the steps until all the discrete intervals are fused into one interval to obtain fused data; the fused data includes: the distance between two adjacent discrete intervals fused each time and the number of the updated discrete intervals; and fitting a curve of the distance with respect to the number of discrete intervals according to the fusion data, and then determining the number of preset intervals by adopting an elbow method. Illustratively, the number c of discrete value intervals is arbitrarily taken, and linear regression is used to fit [1, c ] of the distance curve, respectively]Moiety and [ c +1, N]And N is the maximum possible number of discrete value intervals. As shown in fig. 2 (in the figure, Distance represents Distance, and num. clusterics represents the number of intervals), it is assumed that the root mean square error of the left fitted straight line (i.e., straight line corresponding to fit left) and the right fitted straight line (i.e., straight line corresponding to fit right) and the corresponding part of the Distance curve is elAnd erThen, the fitting error corresponding to the c discrete value intervals is retained as
Figure BDA0002176080980000081
For two vectors a and b with the length of N, the root mean square error is defined as
Figure BDA0002176080980000082
Choosing among all values of c minimizes fitting errors
Figure BDA0002176080980000083
And one of the interval numbers is used as a preset interval number, and a corresponding characteristic discretization result is selected. The number of the preset intervals is determined by adopting an elbow method, so that the error of feature discretization can be reduced on the basis of ensuring the purity of each discrete interval.
The invention not only can effectively optimize storage, eliminate the interference of abnormal values and introduce nonlinear characteristics by carrying out discretization processing on the continuous characteristics, but also can enable some processed characteristics to represent the properties of independent variables. For example, customer churn prediction modeling is performed, and arguments such as the number of customer orders are generally considered in the model. Even under normal conditions, the client always has a certain fluctuation of the single quantity along with the change of time. If the single value is directly substituted into the model, interference is easy to generate, and the model is too sensitive. And if a single quantity can be discretized into several intervals, the interference can be effectively eliminated. Therefore, the change of the state is only explained when a single quantity jumps among the intervals, and the fluctuation in the same interval belongs to the normal category, so that the model can be more robust. On the other hand, many feature values have a stronger physical meaning after discretization than non-discretization. For example, the variation of the single amount can be divided into high and low grades, and can be further subdivided according to different types of clients (easy to lose and not easy to lose) of each section, for example, the client with the single amount being large and not easy to lose is a class A client, the client with the single amount being small and easy to lose is a class B client, and the client with the single amount being large and easy to lose is a class C client. Modeling is carried out after the discretization of the single-quantity characteristics, so that the characteristics are more representative, and the effect of the model is improved.
The feature discretization method of the present invention is exemplified below by the application to customer churn prediction. In this example, the application process is as follows:
(1) data generated by the logistics service of the company J from 1 month and 1 day of 2018 are extracted from the merchant clients who extract the logistics of the company J, and the data comprise freight note dimensionality (freight note total amount, freight note amount using the logistics of the company J, freight note amount using other logistics companies, change value of the freight note amount, volume, weight, freight charge and the like), logistics service dimensionality (time rate of collecting receipts, complaint amount, performance rate and the like for merchants), merchant self dimensionality (main operation category of the merchants, number of skus, scoring, complaints, customer order price, duration of logistics cooperation with the company J and the like). Counting the data changing along with the time in units of weeks;
(2) selecting samples, and for each merchant, selecting data of the merchant in a previous time window (such as 10 weeks) every week, wherein the combination of one merchant and one time window is one sample;
(3) processing the label of the sample, and according to the service aperture, if a merchant does not use the J company logistics within three months from the last week in the time window, the merchant loses in the week, otherwise, the merchant does not lose;
(4) carrying out independent variable processing, and deriving statistical data, wherein the past two-week single quantity, the past three-week single quantity and the like can be derived from the single quantity;
(5) selecting continuous independent variables from the independent variables, and discretizing by using the method in the aspect, wherein the method comprises the following specific steps:
(a) arranging samples from small to large according to the values of the independent variables;
(b) selecting boundary points of different categories as segmentation points to obtain an initial discrete value interval;
(c) by using the improved hierarchical clustering method, the discrete value intervals are gradually fused until the discrete value intervals are fused into one interval, and the fusion distance of each step is recorded in the process to form a curve of the fusion distance to the number of the discrete value intervals;
(d) and searching the number of the discrete value intervals by using an elbow method. And (3) giving the number of a discrete value interval, dividing a curve of the distance vs. the number of the discrete value intervals into a left side and a right side according to the number, respectively fitting by using linear regression, and calculating a fitting error. And finally, selecting the number of discrete value intervals with the minimum fitting error.
(e) And discretizing the continuous features according to the result corresponding to the final discrete value interval number.
(6) Modeling with logistic regression using features derived from continuous independent variable discretization and other features predicts whether merchant customers will lose in the future.
According to a second aspect of embodiments of the present invention, there is provided a feature discretization apparatus. Fig. 3 is a schematic diagram of main blocks of a feature discretization apparatus according to an embodiment of the present invention, and as shown in fig. 3, a feature discretization apparatus 300 according to an embodiment of the present invention includes:
the preprocessing module 301 sorts the samples according to the values of the features;
a discretization module 302, configured to determine each boundary point of the feature according to the information gain of each value-taking point of the feature, so as to discretize the feature into a plurality of discrete intervals;
and the fusion module 303 is used for clustering and fusing the discrete intervals to obtain a discretization result.
Optionally, the determining, by the discrete module, each demarcation point of the feature according to the information gain of each value-taking point of the feature includes:
searching adjacent value point pairs with different sample labels according to the sample label of each characteristic point; and taking the intermediate value of the values of the adjacent value pair as the demarcation point.
Optionally, the clustering and fusing the discrete intervals by the fusion module includes:
comparing the distance between every two adjacent discrete intervals, and fusing the two adjacent discrete intervals with the minimum distance into one interval to obtain an updated discrete interval;
and comparing the distance between every two adjacent discrete intervals in the updated discrete intervals, fusing the two adjacent discrete intervals with the minimum distance into one interval, and repeating the steps until the number of the updated discrete intervals is equal to the number of the preset intervals.
Optionally, the fusion module is further configured to: determining the number of the preset intervals according to the following steps:
comparing the distance between every two adjacent discrete intervals, and fusing the two adjacent discrete intervals with the minimum distance into one interval to obtain an updated discrete interval; comparing the distance between every two adjacent discrete intervals in the updated discrete intervals, fusing the two adjacent discrete intervals with the minimum distance into one interval, and circulating the steps until all the discrete intervals are fused into one interval to obtain fused data; the fused data includes: the distance between two adjacent discrete intervals fused each time and the number of the updated discrete intervals;
and fitting a curve of the distance with respect to the number of discrete intervals according to the fusion data, and then determining the number of preset intervals by adopting an elbow method.
According to a third aspect of embodiments of the present invention, there is provided a feature discretization electronic device including:
one or more processors;
a storage device for storing one or more programs,
when the one or more programs are executed by the one or more processors, the one or more processors implement the feature discretization method provided in the first aspect of the embodiments of the present invention.
According to a fourth aspect of the embodiments of the present invention, there is provided a computer-readable medium on which a computer program is stored, the program, when executed by a processor, implementing the method for discretizing features provided by the first aspect of the embodiments of the present invention.
Fig. 4 illustrates an exemplary system architecture 400 to which the feature discretization method or feature discretization apparatus of embodiments of the invention can be applied.
As shown in fig. 4, the system architecture 400 may include terminal devices 401, 402, 403, a network 404, and a server 405. The network 404 serves as a medium for providing communication links between the terminal devices 401, 402, 403 and the server 405. Network 404 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.
A user may use terminal devices 401, 402, 403 to interact with a server 405 over a network 404 to receive or send messages or the like. The terminal devices 401, 402, 403 may have installed thereon various communication client applications, such as shopping-like applications, web browser applications, search-like applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).
The terminal devices 401, 402, 403 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 405 may be a server providing various services, such as a background management server (for example only) providing support for shopping websites browsed by users using the terminal devices 401, 402, 403. The backend management server may analyze and perform other processing on the received data such as the product information query request, and feed back a processing result (for example, target push information, product information — just an example) to the terminal device.
It should be noted that the feature discretization method provided by the embodiment of the present invention is generally executed by the server 405, and accordingly, the feature discretization apparatus is generally disposed in the server 405.
It should be understood that the number of terminal devices, networks, and servers in fig. 4 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 5, shown is a block diagram of a computer system 500 suitable for use with a terminal device implementing an embodiment of the present invention. The terminal device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 5, the computer system 500 includes a Central Processing Unit (CPU)501 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the system 500 are also stored. The CPU 501, ROM 502, and RAM 503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.
The following components are connected to the I/O interface 505: an input portion 506 including a keyboard, a mouse, and the like; an output portion 507 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The driver 510 is also connected to the I/O interface 505 as necessary. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as necessary, so that a computer program read out therefrom is mounted into the storage section 508 as necessary.
In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 509, and/or installed from the removable medium 511. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 501.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor comprising: the preprocessing module sequences the samples according to the values of the characteristics; the discrete module is used for determining each boundary point of the characteristic according to the information gain of each value taking point of the characteristic so as to discretize the characteristic into a plurality of discrete intervals; and the fusion module is used for clustering and fusing the discrete intervals to obtain a discretization result. The names of the modules do not form a limitation on the modules themselves in some cases, and for example, a discrete module may also be described as a module for clustering and fusing several discrete intervals.
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: sequencing the samples according to the values of the characteristics; determining each demarcation point of the characteristic according to the information gain of each value-taking point of the characteristic so as to discretize the characteristic into a plurality of discrete intervals; and clustering and fusing the discrete intervals to obtain a discretization result.
According to the technical scheme of the embodiment of the invention, the distribution condition (unsupervised information) of the characteristic values and the labels (supervised information) of the samples are fully combined, and the characteristics of the characteristic values implied by the two types of information are mined to form a semi-supervised continuous characteristic discretization method.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method of discretizing features, comprising:
sequencing the samples according to the values of the characteristics;
determining each demarcation point of the characteristic according to the information gain of each value-taking point of the characteristic so as to discretize the characteristic into a plurality of discrete intervals;
and clustering and fusing the discrete intervals to obtain a discretization result.
2. The method of discretizing features of claim 1 wherein determining respective demarcation points for the features based on information gain for each point of interest of the features comprises:
searching adjacent value point pairs with different sample labels according to the sample label of each characteristic point; and taking the intermediate value of the values of the adjacent value pair as the demarcation point.
3. The method for discretizing features of claim 1, wherein clustering the plurality of discrete intervals comprises:
comparing the distance between every two adjacent discrete intervals, and fusing the two adjacent discrete intervals with the minimum distance into one interval to obtain an updated discrete interval;
and comparing the distance between every two adjacent discrete intervals in the updated discrete intervals, fusing the two adjacent discrete intervals with the minimum distance into one interval, and repeating the steps until the number of the updated discrete intervals is equal to the number of the preset intervals.
4. The method for discretizing features according to claim 3, characterized in that said predetermined number of intervals is determined according to the following steps:
comparing the distance between every two adjacent discrete intervals, and fusing the two adjacent discrete intervals with the minimum distance into one interval to obtain an updated discrete interval; comparing the distance between every two adjacent discrete intervals in the updated discrete intervals, fusing the two adjacent discrete intervals with the minimum distance into one interval, and circulating the steps until all the discrete intervals are fused into one interval to obtain fused data; the fused data includes: the distance between two adjacent discrete intervals fused each time and the number of the updated discrete intervals;
and fitting a curve of the distance with respect to the number of discrete intervals according to the fusion data, and then determining the number of preset intervals by adopting an elbow method.
5. A feature discretization apparatus, comprising:
the preprocessing module sequences the samples according to the values of the characteristics;
the discrete module is used for determining each boundary point of the characteristic according to the information gain of each value-taking point of the characteristic so as to discretize the characteristic into a plurality of discrete intervals;
and the fusion module is used for clustering and fusing the discrete intervals to obtain a discretization result.
6. The feature discretization apparatus of claim 5, wherein the discretization module determining the respective demarcation points for the feature based on the information gain for each of the valued points of the feature comprises:
searching adjacent value point pairs with different sample labels according to the sample label of each characteristic point; and taking the intermediate value of the values of the adjacent value pair as the demarcation point.
7. The feature discretization apparatus of claim 5, wherein the fusion module cluster-fuses the plurality of discrete intervals comprises:
comparing the distance between every two adjacent discrete intervals, and fusing the two adjacent discrete intervals with the minimum distance into one interval to obtain an updated discrete interval;
and comparing the distance between every two adjacent discrete intervals in the updated discrete intervals, fusing the two adjacent discrete intervals with the minimum distance into one interval, and repeating the steps until the number of the updated discrete intervals is equal to the number of the preset intervals.
8. The feature discretization apparatus of claim 7, wherein the fusion module is further configured to: determining the number of the preset intervals according to the following steps:
comparing the distance between every two adjacent discrete intervals, and fusing the two adjacent discrete intervals with the minimum distance into one interval to obtain an updated discrete interval; comparing the distance between every two adjacent discrete intervals in the updated discrete intervals, fusing the two adjacent discrete intervals with the minimum distance into one interval, and circulating the steps until all the discrete intervals are fused into one interval to obtain fused data; the fused data includes: the distance between two adjacent discrete intervals fused each time and the number of the updated discrete intervals;
and fitting a curve of the distance with respect to the number of discrete intervals according to the fusion data, and then determining the number of preset intervals by adopting an elbow method.
9. A feature discretization electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-4.
10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-4.
CN201910779357.7A 2019-08-22 2019-08-22 Feature discretization method and device Pending CN112418258A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910779357.7A CN112418258A (en) 2019-08-22 2019-08-22 Feature discretization method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910779357.7A CN112418258A (en) 2019-08-22 2019-08-22 Feature discretization method and device

Publications (1)

Publication Number Publication Date
CN112418258A true CN112418258A (en) 2021-02-26

Family

ID=74778948

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910779357.7A Pending CN112418258A (en) 2019-08-22 2019-08-22 Feature discretization method and device

Country Status (1)

Country Link
CN (1) CN112418258A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113222651A (en) * 2021-04-29 2021-08-06 西安点告网络科技有限公司 Advertisement putting model statistical class characteristic discretization method, system, equipment and medium
CN113742801A (en) * 2021-08-31 2021-12-03 阳光新能源开发有限公司 Road data processing method and device and electronic equipment
CN114297454A (en) * 2021-12-30 2022-04-08 医渡云(北京)技术有限公司 Method and device for discretizing features, electronic equipment and computer readable medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100088315A1 (en) * 2008-10-05 2010-04-08 Microsoft Corporation Efficient large-scale filtering and/or sorting for querying of column based data encoded structures
CN107562698A (en) * 2017-08-03 2018-01-09 北京京东尚科信息技术有限公司 A kind of optimization method and device of sample value interval model
CN108170837A (en) * 2018-01-12 2018-06-15 平安科技(深圳)有限公司 Method of Data Discretization, device, computer equipment and storage medium
CN108846259A (en) * 2018-04-26 2018-11-20 河南师范大学 A kind of gene sorting method and system based on cluster and random forests algorithm
CN109858785A (en) * 2019-01-16 2019-06-07 中国电力科学研究院有限公司 A kind of method and system for evaluating intelligent electric energy meter operating status
WO2019114423A1 (en) * 2017-12-15 2019-06-20 阿里巴巴集团控股有限公司 Method and apparatus for merging model prediction values, and device
CN109933619A (en) * 2019-03-13 2019-06-25 西南交通大学 A kind of semisupervised classification prediction technique

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100088315A1 (en) * 2008-10-05 2010-04-08 Microsoft Corporation Efficient large-scale filtering and/or sorting for querying of column based data encoded structures
CN107562698A (en) * 2017-08-03 2018-01-09 北京京东尚科信息技术有限公司 A kind of optimization method and device of sample value interval model
WO2019114423A1 (en) * 2017-12-15 2019-06-20 阿里巴巴集团控股有限公司 Method and apparatus for merging model prediction values, and device
CN108170837A (en) * 2018-01-12 2018-06-15 平安科技(深圳)有限公司 Method of Data Discretization, device, computer equipment and storage medium
CN108846259A (en) * 2018-04-26 2018-11-20 河南师范大学 A kind of gene sorting method and system based on cluster and random forests algorithm
CN109858785A (en) * 2019-01-16 2019-06-07 中国电力科学研究院有限公司 A kind of method and system for evaluating intelligent electric energy meter operating status
CN109933619A (en) * 2019-03-13 2019-06-25 西南交通大学 A kind of semisupervised classification prediction technique

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李军, 刘艳, 顾雪平: "基于信息熵的属性离散化算法在暂态稳定评估中的应用", 电力系统自动化, no. 08, 15 August 2005 (2005-08-15) *
许志兴 等: "基于信息熵的连续属性自动聚类算法", 南京航空航天大学学报, 30 June 2001 (2001-06-30), pages 2 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113222651A (en) * 2021-04-29 2021-08-06 西安点告网络科技有限公司 Advertisement putting model statistical class characteristic discretization method, system, equipment and medium
CN113742801A (en) * 2021-08-31 2021-12-03 阳光新能源开发有限公司 Road data processing method and device and electronic equipment
CN114297454A (en) * 2021-12-30 2022-04-08 医渡云(北京)技术有限公司 Method and device for discretizing features, electronic equipment and computer readable medium
CN114297454B (en) * 2021-12-30 2023-01-03 医渡云(北京)技术有限公司 Method and device for discretizing features, electronic equipment and computer readable medium

Similar Documents

Publication Publication Date Title
CN108536650B (en) Method and device for generating gradient lifting tree model
CN110555640B (en) Route planning method and device
CN110751497A (en) Commodity replenishment method and device
CN110348771B (en) Method and device for order grouping of orders
CN112418258A (en) Feature discretization method and device
CN111160847B (en) Method and device for processing flow information
CN111767455A (en) Information pushing method and device
CN110866625A (en) Promotion index information generation method and device
CN113743971A (en) Data processing method and device
CN112784212A (en) Method and device for optimizing inventory
CN110766431A (en) Method and device for judging whether user is sensitive to coupon
CN112231299B (en) Method and device for dynamically adjusting feature library
CN112785213B (en) Warehouse manifest picking construction method and device
CN112990311A (en) Method and device for identifying admitted client
CN112906723A (en) Feature selection method and device
CN114092194A (en) Product recommendation method, device, medium and equipment
CN114677174A (en) Method and device for calculating sales volume of unladen articles
CN113780333A (en) User group classification method and device
CN112528103A (en) Method and device for recommending objects
CN112100291A (en) Data binning method and device
CN113379173A (en) Method and apparatus for labeling warehouse goods
CN110895564A (en) Potential customer data processing method and device
CN112862554A (en) Order data processing method and device
CN112667770A (en) Method and device for classifying articles
CN112734352A (en) Document auditing method and device based on data dimensionality

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination