CN107644102B - Data feature construction method and device, storage medium and electronic equipment - Google Patents

Data feature construction method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN107644102B
CN107644102B CN201710954269.7A CN201710954269A CN107644102B CN 107644102 B CN107644102 B CN 107644102B CN 201710954269 A CN201710954269 A CN 201710954269A CN 107644102 B CN107644102 B CN 107644102B
Authority
CN
China
Prior art keywords
features
characteristic
value
data
target data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710954269.7A
Other languages
Chinese (zh)
Other versions
CN107644102A (en
Inventor
王硕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201710954269.7A priority Critical patent/CN107644102B/en
Publication of CN107644102A publication Critical patent/CN107644102A/en
Application granted granted Critical
Publication of CN107644102B publication Critical patent/CN107644102B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The disclosure relates to a data feature construction method and a device, belonging to the technical field of data processing, wherein the method comprises the following steps: acquiring a plurality of original data characteristics and cleaning each original data characteristic to obtain a plurality of first target data characteristics; and fusing the first target data features to obtain a plurality of second target data features. According to the method, the first target data features are fused to obtain the second target data features, hidden information in the data features can be mined, so that the association among the data features can be further mined from the hidden information, and a model trained and predicted through the data features has high accuracy.

Description

Data feature construction method and device, storage medium and electronic equipment
Technical Field
The present disclosure relates to the field of data processing technologies, and in particular, to a data feature construction method, a data feature construction device, a computer-readable storage medium, and an electronic device.
Background
In a data mining project, some data features which are sparse for users need to be extracted for feature mining. Sparse features may include, for example, features of windows, features of the user's browsing, buyback, and search for up to 1-5 days, and so on.
However, after data statistics, the characteristic value of the data characteristic is mostly zero. In general, the feature value selection of the data feature needs to include three elements: information (information rich), Discriminative, and Independent; when the data features are too sparse, the feature values of the data features cannot satisfy the three elements.
Generally, the output effect of the model can be better only by selecting the good and strong indexing features, and the random deletion of the sparse features of the user can lose much hidden important information, so that a larger error is brought to the result of data mining.
Therefore, it is desirable to provide a new data feature construction method and apparatus.
It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.
Disclosure of Invention
An object of the present disclosure is to provide a data feature construction method, a data feature construction apparatus, a computer-readable storage medium, and an electronic device, thereby overcoming, at least to some extent, one or more of the problems due to the limitations and disadvantages of the related art.
According to an aspect of the present disclosure, there is provided a data feature construction method including:
acquiring a plurality of original data characteristics and cleaning each original data characteristic to obtain a plurality of first target data characteristics;
and fusing the first target data features to obtain a plurality of second target data features.
In an exemplary embodiment of the disclosure, the cleaning each of the raw data features to obtain a plurality of first target data features includes:
and removing abnormal features in the original data features to obtain a plurality of first target data features.
In an exemplary embodiment of the present disclosure, the removing the abnormal feature value in each of the original data features includes:
judging whether the characteristic value of each original data characteristic is smaller than a first preset value or not;
when the characteristic value of each original data characteristic is judged to be smaller than the first preset value, the original data characteristic corresponding to the characteristic value is removed; and
when the characteristic value of each original data characteristic is judged to be not smaller than the first preset value, judging whether the characteristic value of each original data characteristic is larger than a second preset value or not;
and when the characteristic value of each original data characteristic is judged to be larger than the second preset value, the original data characteristic corresponding to the characteristic value is removed.
In an exemplary embodiment of the disclosure, fusing each of the first target data features to obtain a plurality of second target data features includes:
constructing a plurality of characteristic indexes;
calculating each feature index of each first target data feature;
and fusing the feature indexes of the first target data features to obtain a plurality of second target data features.
In an exemplary embodiment of the present disclosure, the characteristic indicator includes a plurality of a mean, a median, an upper quartile, a standard deviation, and a maximum value.
In an exemplary embodiment of the present disclosure, the raw data features include a plurality of purchase features, browse features, buy-in features, search features, and ticket features.
According to an aspect of the present disclosure, there is provided a data feature construction apparatus including:
the data characteristic cleaning module is used for acquiring a plurality of original data characteristics and cleaning each original data characteristic to obtain a plurality of first target data characteristics;
and the data feature fusion module is used for fusing the first target data features to obtain a plurality of second target data features.
In an exemplary embodiment of the disclosure, the cleaning each of the raw data features to obtain a plurality of first target data features includes:
and removing abnormal features in the original data features to obtain a plurality of first target data features.
In an exemplary embodiment of the present disclosure, the removing the abnormal feature value in each of the original data features includes:
judging whether the characteristic value of each original data characteristic is smaller than a first preset value or not;
when the characteristic value of each original data characteristic is judged to be smaller than the first preset value, the original data characteristic corresponding to the characteristic value is removed; and
when the characteristic value of each original data characteristic is judged to be not smaller than the first preset value, judging whether the characteristic value of each original data characteristic is larger than a second preset value or not;
and when the characteristic value of each original data characteristic is judged to be larger than the second preset value, the original data characteristic corresponding to the characteristic value is removed.
In an exemplary embodiment of the disclosure, fusing each of the first target data features to obtain a plurality of second target data features includes:
constructing a plurality of characteristic indexes;
calculating each feature index of each first target data feature;
and fusing the feature indexes of the first target data features to obtain a plurality of second target data features.
In an exemplary embodiment of the present disclosure, the characteristic indicator includes a plurality of a mean, a median, an upper quartile, a standard deviation, and a maximum value.
In an exemplary embodiment of the present disclosure, the raw data features include a plurality of purchase features, browse features, buy-in features, search features, and ticket features.
According to an aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a data feature construction method as recited in any one of the above.
According to an aspect of the present disclosure, there is provided an electronic device including:
a processor; and
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform any of the data feature construction methods described above via execution of the executable instructions.
The invention discloses a data feature construction method and a data feature construction device, wherein a plurality of first target data features are obtained by acquiring a plurality of original data features and cleaning each original data feature; then fusing the first target data characteristics to obtain a plurality of second target data characteristics; on one hand, by cleaning each original data feature, abnormal data features in the original data features can be cleaned, so that the constructed data features can comprise each element (rich in information content, distinguishing property and independence) selected by the features, and the accuracy of the data features is improved; on the other hand, the hidden information in each data feature can be mined by fusing each first target data feature to obtain a plurality of second target data features, so that the association between each data feature can be further mined from the hidden information, and a model trained and predicted by the data feature has higher accuracy.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty.
FIG. 1 schematically illustrates a flow chart of a data feature construction method.
Fig. 2 schematically shows a flowchart of a method for culling abnormal feature values in original data features.
Fig. 3 schematically shows an illustration of a box diagram.
Fig. 4 schematically shows a flow chart of a method of fusing first target data.
Fig. 5 schematically shows a block diagram of a data feature construction apparatus.
Fig. 6 schematically shows an electronic device for implementing the above-described data feature construction method.
Fig. 7 schematically illustrates a computer-readable storage medium for implementing the above-described data feature construction method.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and the like. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present disclosure.
Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
The present exemplary embodiment first provides a data feature construction method. Referring to fig. 1, the data feature construction method may include the steps of:
step S110, a plurality of original data features are obtained, and each original data feature is cleaned to obtain a plurality of first target data features.
And S120, fusing the first target data characteristics to obtain a plurality of second target data characteristics.
In the data feature construction method, on one hand, abnormal data features in the original data features can be cleaned by cleaning the original data features, so that the constructed data features can comprise various elements (rich in information, with distinguishing properties and independence) selected by the features, and the accuracy of the data features is improved; on the other hand, the hidden information in each data feature can be mined by fusing each first target data feature to obtain a plurality of second target data features, so that the association between each data feature can be further mined from the hidden information, and a model trained and predicted by the data feature has higher accuracy.
Hereinafter, each step in the above-described data feature construction method in the present exemplary embodiment will be explained and explained in detail.
In step S110, a plurality of original data features are obtained and each of the original data features is cleaned to obtain a plurality of first target data features.
First, the above-described original data characteristics are explained and explained. Raw data features may include purchase features, browse features, buy-in features, search features, and coupon features, among others; other data features may also be included, for example, touch features and the like may be included, which is not limited in this example. Wherein:
the purchase characteristics may be sku amount, customer unit price, order amount in the last month, order amount in the last three months, order amount in the last three to six months, etc. purchased by the user in the last year; other features may also be included, such as the volume of orders in the last year, etc., which the present example is not particularly limited;
the browsing characteristics can be the browsing sku amount and browsing times of the user in about 1, 2, 3, 4 and 5 days, browsing times in about 7 days, 7-15 days, 15-21 days and 21-28 days, etc.; other characteristics may also be included, such as 7-day browsing sku amount, etc., which is not particularly limited by this example;
the purchase adding characteristics can be the sku number, the third-level product type, the brand number and the like of the purchase adding for nearly 1-3 days, 1-5 days, 1-7 days, 7-15 days and 15-30 days of the user; other characteristics may also be included, such as sku number, tertiary category, brand number, etc. that may be purchased within 2 months; other features and the like may also be included, and the present example is not particularly limited thereto;
the search features may be the number of times of searching for the keyword, the number of brands searched, the number of first-level categories searched, the number of third-level categories searched, and the like in the last 10 days, 20 days, and 30 days of the user, or may be other features, such as the number of times of searching for the keyword, the number of brands searched, the number of first-level categories searched, the number of third-level categories, and the like in 2 months, which is not particularly limited in this example;
the coupon feature may be the number of coupons used, the number of orders purchased, the amount of coupons used, the amount of offers, the type of tertiary items purchased, etc. in the last 3 months, 3-6 months, 6-9 months, 9-12 months of the user, or may include other features, such as the number of skus purchased, etc., which is not limited by this example.
Next, step S110 is further explained and explained based on the above-mentioned raw data characteristics. Firstly, acquiring purchasing characteristics, browsing characteristics, purchase adding characteristics, searching characteristics and coupon characteristics of all users; then, the cleaning of the raw data features may specifically include: and removing abnormal features in the original data features to obtain a plurality of first target data features. It should be added here that before removing the abnormal features in the original data features, the risky users and the billing users in the original data features may be removed to obtain the original data features in the normal user state. Further, referring to fig. 2, the elimination of the abnormal feature value in the original data feature may include steps S210 to S240. Wherein:
in step S210, it is determined whether the feature value of each of the original data features is smaller than a first preset value.
First, a Box-plot (Box-plot) is explained and explained. The box plot (box-plot) provides a criterion for the identification of outliers, and the quartile used in the box plot is briefly introduced. Referring to fig. 3, the quartile is obtained by arranging all the numerical values from small to large and dividing the numerical values into four equal parts, and the numerical values at the three division points can be called as the quartile; wherein, the first quartile (Q1), which can also be called the lower quartile, is equal to the 25 th% of the numbers in the sample after all the numbers are arranged from small to large; the second quartile (Q2), which may also be referred to as the median, is equal to the 50 th% of the numbers in the sample after all values are arranged from small to large; a third quartile (Q3), which may also be referred to as the upper quartile, is equal to the 75% of the numbers in the sample after all values are arranged from small to large; the difference between the third Quartile and the first Quartile may also be referred to as an interquartile Range (IQR). Further, outliers may be defined as values less than Q1-1.5IQR or greater than Q3+1.5IQR in a batch of data.
Next, step S210 is explained and explained based on the above-mentioned quartile. Firstly, judging whether the characteristic value of each original data characteristic is smaller than a first preset value. For example:
for example, if the feature value of the purchase feature is 20 and the first predetermined value Q1-1.5IQR is 35 in the original data feature, it may be determined that the feature value of the original data feature is smaller than the first predetermined value. It should be added here that, since the box diagram is used to determine the abnormal value, the size of the first preset value is established depending on the box diagram; when the abnormal value is determined by other methods, the decision rule of the first preset value is changed accordingly, which is not particularly limited in this example.
In step S220, when the feature value of each of the original data features is determined to be smaller than the first preset value, the original data feature corresponding to the feature value is removed. In detail:
in the original data characteristics, the characteristic value of the purchase characteristic is 20, and the value of the first preset value Q1-1.5IQR is 35, then the characteristic value of the original data characteristics can be judged to be smaller than the first preset value; the data characteristic corresponding to the characteristic value can be deleted. It should be added that, since each original data feature includes a plurality of features, in order to improve the calculation accuracy, the sub-features smaller than the first preset value may be deleted by determining the feature value of each sub-feature of each original data feature, which is not limited in this example.
In step S230, when the feature value of each of the original data features is determined to be not less than the first preset value, it is determined whether the feature value of each of the original data features is greater than a second preset value. In detail:
when the characteristic value of each original data characteristic is judged to be greater than the first preset value (Q1-1.5IQR), whether the characteristic value of each original data characteristic is smaller than the second preset value (Q3+1.5IQR) is also required to be judged. For example, if the feature value of the browsing feature is 2000 and the value of the second preset value Q3+1.5IQR is 1800, it may be determined that the feature value of the original data feature is greater than the second preset value. It should be added here that, since the box diagram is used to determine the abnormal value, the size of the second preset value is also established depending on the box diagram; when the abnormal value is determined by other methods, the decision rule of the second preset value is changed accordingly, which is not particularly limited in this example.
In step S240, when the feature value of each of the original data features is determined to be greater than the second preset value, the original data feature corresponding to the feature value is removed. In detail:
in the original data features, the feature value of the browsing feature is 2000, and the value of the second preset value Q3+1.5IQR is 1800, then it can be determined that the feature value of the original data features is greater than the second preset value; the data characteristic corresponding to the characteristic value can be deleted. It should be added that, since each original data feature includes a plurality of features, in order to improve the calculation accuracy, the sub-features larger than the second preset value may be deleted by determining the feature value of each sub-feature of each original data feature, which is not limited in this example.
In step S120, the first target data features are fused to obtain a plurality of second target data features. As shown in fig. 4, the fusing of the first target data feature may include steps S1202 to S1206. Wherein:
in step S1202, a plurality of feature indexes are constructed.
In the present exemplary embodiment, the characteristic index may be a plurality of a mean value, a median value, an upper quartile value, a standard deviation, and a maximum value; other characteristic indexes may also be included, such as the first quartile and the like, which is not particularly limited in this example. Wherein:
mean value: the mean may represent the number of trends in a set of data sets, which is the sum of all data in a set of data divided by the number of the set of data; median: for a limited data set, one in the middle can be found as a median by sorting all observed values in height; upper quartile: the upper quartile refers to the dispersion degree of the skewed data when the data is described by a quartile statistical description analysis method, namely, the number of the data which is arranged from small to large and is just arranged at the position 1/4 below is called a lower quartile (according to the percentage ratio, namely the number at the position 75%) and is also called a first quartile, and the number arranged at the position 1/4 above is called an upper quartile (according to the percentage ratio, namely the number at the position 25%) and is also called a third quartile; standard deviation: the standard deviation is the square root of the arithmetic mean of the squared deviations from the mean, which is the arithmetic square root of the variance; further, the standard deviation may reflect the degree of dispersion of a data set.
In step S1204, each feature index of each first target data feature is calculated. For example:
and calculating the average value of the sku number of the total station user which is bought in 5 days, the median of the sku number of the total station user which is bought in 5 days, the upper quartile of the sku number of the total station user which is bought in 5 days, the standard deviation of the sku number of the total station user which is bought in 5 days and the maximum value of the sku number of the total station user which is bought in 5 days. It should be added that, in the present exemplary embodiment, the purchased sku sub-feature of the purchased feature in the original data feature is taken as an example for description, and other sub-features may also be taken as examples, which are not described herein again.
In step S1206, the feature indexes of the first target data features are fused to obtain a plurality of second target data features. In detail:
after the above calculation is completed, the following calculation is performed to perform the fusion of the feature indexes:
A. the sku number of all users purchased in 5 days to the average value of the sku number of all users purchased in 5 days;
B. (ii) sku number of all users purchased in approximately 5 days-median sku number of all users purchased in approximately 5 days;
C. the number of skus purchased by all users in 5 days-the upper quartile of the number of skus purchased by all users in 5 days;
D. (the sku number of all users purchased in 5 days-the standard deviation of the sku number of all users purchased in 5 days)/the average value of the sku number of all users purchased in 5 days;
E. the maximum value of the sku number of all users which are bought in 5 days/the sku number of all users which are bought in 5 days is obtained;
after the characteristic indexes are fused, the calculation result of A-E is used as a second target data characteristic, and the second target data characteristic is input into a GBDT (Gradient Boosting Decision Tree) algorithm for training; the training result obtained by the second target data characteristic has better accuracy than the training result of the original data characteristic, and the sparsity of the characteristic is greatly avoided.
The present example embodiment also provides a data feature construction apparatus. Referring to FIG. 5, the data feature construction apparatus may include a data feature cleaning module 510 and a data feature fusion module 520. Wherein:
the data feature cleaning module 510 may be configured to obtain a plurality of raw data features and clean each of the raw data features to obtain a plurality of first target data features.
The data feature fusion module 520 may be configured to fuse the first target data features to obtain a plurality of second target data features.
In this exemplary embodiment, the cleaning each of the original data features to obtain a plurality of first target data features includes: and removing abnormal features in the original data features to obtain a plurality of first target data features.
In this exemplary embodiment, the removing the abnormal feature value in each of the original data features includes: judging whether the characteristic value of each original data characteristic is smaller than a first preset value or not; when the characteristic value of each original data characteristic is judged to be smaller than the first preset value, the original data characteristic corresponding to the characteristic value is removed; when the characteristic value of each original data characteristic is judged to be not smaller than the first preset value, judging whether the characteristic value of each original data characteristic is larger than a second preset value or not; and when the characteristic value of each original data characteristic is judged to be larger than the second preset value, the original data characteristic corresponding to the characteristic value is removed.
In this exemplary embodiment, fusing each of the first target data features to obtain a plurality of second target data features includes: constructing a plurality of characteristic indexes; calculating each feature index of each first target data feature; and fusing the feature indexes of the first target data features to obtain a plurality of second target data features.
In the present exemplary embodiment, the characteristic indicator includes a plurality of a mean value, a median value, an upper quartile value, a standard deviation, and a maximum value.
In this example embodiment, the raw data features include a plurality of purchase features, browse features, buy-in features, search features, and ticket features.
The specific details of each module in the data feature construction device have been described in detail in the corresponding data feature construction method, and therefore are not described herein again.
It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.
Moreover, although the steps of the methods of the present disclosure are depicted in the drawings in a particular order, this does not require or imply that the steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a mobile terminal, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.
In an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.
An electronic device 600 according to this embodiment of the invention is described below with reference to fig. 6. The electronic device 600 shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 6, the electronic device 600 is embodied in the form of a general purpose computing device. The components of the electronic device 600 may include, but are not limited to: the at least one processing unit 610, the at least one memory unit 620, and a bus 630 that couples the various system components including the memory unit 620 and the processing unit 610.
Wherein the storage unit stores program code that is executable by the processing unit 610 to cause the processing unit 610 to perform steps according to various exemplary embodiments of the present invention as described in the above section "exemplary methods" of the present specification. For example, the processing unit 610 may perform step S110 as shown in fig. 1: acquiring a plurality of original data characteristics and cleaning each original data characteristic to obtain a plurality of first target data characteristics; s120: and fusing the first target data features to obtain a plurality of second target data features.
The storage unit 620 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM)6201 and/or a cache memory unit 6202, and may further include a read-only memory unit (ROM) 6203.
The memory unit 620 may also include a program/utility 6204 having a set (at least one) of program modules 6205, such program modules 6205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Bus 630 may be one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 600 may also communicate with one or more external devices 700 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 600, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 600 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 650. Also, the electronic device 600 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 660. As shown, the network adapter 660 communicates with the other modules of the electronic device 600 over the bus 630. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 600, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.
In an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, aspects of the invention may also be implemented in the form of a program product comprising program code means for causing a terminal device to carry out the steps according to various exemplary embodiments of the invention described in the above section "exemplary methods" of the present description, when said program product is run on the terminal device.
Referring to fig. 7, a program product 800 for implementing the above method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
Furthermore, the above-described figures are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the invention, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims (8)

1. A method of data feature construction, comprising:
acquiring a plurality of original data characteristics and cleaning each original data characteristic to obtain a plurality of first target data characteristics; wherein the raw data features include a plurality of purchase features, browsing features, shopping-adding features, searching features, and ticket features;
fusing the first target data features to obtain a plurality of second target data features; wherein the second target data feature is used for training a gradient boosting decision tree;
the fusion of the first target data features to obtain a plurality of second target data features comprises:
constructing a plurality of characteristic indexes; wherein the characteristic index comprises a plurality of mean values, median values, upper quartile values, standard deviations and maximum values;
calculating the mean, median, upper quartile, standard deviation and maximum of each first target data feature;
calculating a first difference value, a second difference value, a third difference value and a fourth difference value between each first target data characteristic and each mean value, median, upper quartile and standard deviation;
calculating a first ratio between each of the fourth difference values and each of the mean values, and a second ratio between a feature value of each of the first target data features and each of the maximum values;
and obtaining a plurality of second target data characteristics according to the first difference, the second difference, the third difference, the first ratio and the second ratio of each first target data characteristic.
2. The data feature construction method of claim 1, wherein cleaning each of the raw data features to obtain a plurality of first target data features comprises:
and removing abnormal features in the original data features to obtain a plurality of first target data features.
3. The data feature construction method according to claim 2, wherein the removing of the abnormal feature value in each of the original data features comprises:
judging whether the characteristic value of each original data characteristic is smaller than a first preset value or not;
when the characteristic value of each original data characteristic is judged to be smaller than the first preset value, the original data characteristic corresponding to the characteristic value is removed; and
when the characteristic value of each original data characteristic is judged to be not smaller than the first preset value, judging whether the characteristic value of each original data characteristic is larger than a second preset value or not;
and when the characteristic value of each original data characteristic is judged to be larger than the second preset value, the original data characteristic corresponding to the characteristic value is removed.
4. A data feature construction apparatus, comprising:
the data characteristic cleaning module is used for acquiring a plurality of original data characteristics and cleaning each original data characteristic to obtain a plurality of first target data characteristics; wherein the raw data features include a plurality of purchase features, browsing features, shopping-adding features, searching features, and ticket features;
the data feature fusion module is used for fusing the first target data features to obtain a plurality of second target data features; wherein the second target data feature is used for training a gradient boosting decision tree;
the fusion of the first target data features to obtain a plurality of second target data features comprises:
constructing a plurality of characteristic indexes; wherein the characteristic index comprises a plurality of mean values, median values, upper quartile values, standard deviations and maximum values;
calculating the mean, median, upper quartile, standard deviation and maximum of each first target data feature;
calculating a first difference value, a second difference value, a third difference value and a fourth difference value between each first target data characteristic and each mean value, median, upper quartile and standard deviation;
calculating a first ratio between each of the fourth difference values and each of the mean values, and a second ratio between a feature value of each of the first target data features and each of the maximum values;
and obtaining a plurality of second target data characteristics according to the first difference, the second difference, the third difference, the first ratio and the second ratio of each first target data characteristic.
5. The data feature construction device of claim 4, wherein the cleaning of each of the raw data features to obtain a plurality of first target data features comprises:
and removing abnormal features in the original data features to obtain a plurality of first target data features.
6. The data feature construction device according to claim 5, wherein the removing of the abnormal feature value in each of the original data features comprises:
judging whether the characteristic value of each original data characteristic is smaller than a first preset value or not;
when the characteristic value of each original data characteristic is judged to be smaller than the first preset value, the original data characteristic corresponding to the characteristic value is removed; and
when the characteristic value of each original data characteristic is judged to be not smaller than the first preset value, judging whether the characteristic value of each original data characteristic is larger than a second preset value or not;
and when the characteristic value of each original data characteristic is judged to be larger than the second preset value, the original data characteristic corresponding to the characteristic value is removed.
7. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the data feature construction method of any one of claims 1 to 3.
8. An electronic device, comprising:
a processor; and
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform the data feature construction method of any of claims 1-3 via execution of the executable instructions.
CN201710954269.7A 2017-10-13 2017-10-13 Data feature construction method and device, storage medium and electronic equipment Active CN107644102B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710954269.7A CN107644102B (en) 2017-10-13 2017-10-13 Data feature construction method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710954269.7A CN107644102B (en) 2017-10-13 2017-10-13 Data feature construction method and device, storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN107644102A CN107644102A (en) 2018-01-30
CN107644102B true CN107644102B (en) 2020-11-03

Family

ID=61123521

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710954269.7A Active CN107644102B (en) 2017-10-13 2017-10-13 Data feature construction method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN107644102B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109919357B (en) * 2019-01-30 2021-01-22 创新先进技术有限公司 Data determination method, device, equipment and medium
CN112199374B (en) * 2020-09-29 2023-12-05 中国平安人寿保险股份有限公司 Data feature mining method for data missing and related equipment thereof
CN113315721B (en) * 2021-05-26 2023-01-17 恒安嘉新(北京)科技股份公司 Network data feature processing method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101702262A (en) * 2009-11-06 2010-05-05 北京交通大学 Data syncretizing method for urban traffic circulation indexes
CN105512687A (en) * 2015-12-15 2016-04-20 北京锐安科技有限公司 Emotion classification model training and textual emotion polarity analysis method and system
CN105930942A (en) * 2016-06-03 2016-09-07 北京理工大学 Intelligent system for predicting energy technologies under big data background
CN106777274A (en) * 2016-06-16 2017-05-31 北京理工大学 A kind of Chinese tour field knowledge mapping construction method and system
CN106846805A (en) * 2017-03-06 2017-06-13 南京多伦科技股份有限公司 A kind of dynamic road grid traffic needing forecasting method and its system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8478711B2 (en) * 2011-02-18 2013-07-02 Larus Technologies Corporation System and method for data fusion with adaptive learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101702262A (en) * 2009-11-06 2010-05-05 北京交通大学 Data syncretizing method for urban traffic circulation indexes
CN105512687A (en) * 2015-12-15 2016-04-20 北京锐安科技有限公司 Emotion classification model training and textual emotion polarity analysis method and system
CN105930942A (en) * 2016-06-03 2016-09-07 北京理工大学 Intelligent system for predicting energy technologies under big data background
CN106777274A (en) * 2016-06-16 2017-05-31 北京理工大学 A kind of Chinese tour field knowledge mapping construction method and system
CN106846805A (en) * 2017-03-06 2017-06-13 南京多伦科技股份有限公司 A kind of dynamic road grid traffic needing forecasting method and its system

Also Published As

Publication number Publication date
CN107644102A (en) 2018-01-30

Similar Documents

Publication Publication Date Title
CN107330445B (en) User attribute prediction method and device
CA2947577C (en) Method and apparatus for processing service requests
US9031967B2 (en) Natural language processing system, method and computer program product useful for automotive data mapping
US20190066185A1 (en) Method and system for attribute extraction from product titles using sequence labeling algorithms
US10134076B2 (en) Method and system for attribute extraction from product titles using sequence labeling algorithms
CN109389442A (en) Method of Commodity Recommendation and device, storage medium and electric terminal
CN107644102B (en) Data feature construction method and device, storage medium and electronic equipment
CN111738805B (en) Behavior log-based search recommendation model generation method, device and storage medium
CN109934646B (en) Method and device for predicting associated purchasing behavior of new commodity
CN107357874A (en) User classification method and device, electronic equipment, storage medium
CN109871311A (en) A kind of method and apparatus for recommending test case
CN111754278A (en) Article recommendation method and device, computer storage medium and electronic equipment
CN113780479A (en) Periodic prediction model training method and device, and periodic prediction method and equipment
CN115293332A (en) Method, device and equipment for training graph neural network and storage medium
CN111695979A (en) Method, device and equipment for analyzing relation between raw material and finished product
CN113139115A (en) Information recommendation method, search method, device, client, medium and equipment
CN109978306B (en) Data processing method and device
CN111368189B (en) Goods source sorting recommendation method and device, electronic equipment and storage medium
CN110348581B (en) User feature optimizing method, device, medium and electronic equipment in user feature group
CN109800933B (en) Risk assessment method and device, storage medium and electronic equipment
CN106651408B (en) Data analysis method and device
CN113837843B (en) Product recommendation method and device, medium and electronic equipment
CN115795144A (en) Product recommendation method and device and electronic equipment
CN112328899B (en) Information processing method, information processing apparatus, storage medium, and electronic device
CN110782287A (en) Entity similarity calculation method and device, article recommendation system, medium and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant