CN110443305A - Self-adaptive features processing method and processing device - Google Patents

Self-adaptive features processing method and processing device Download PDF

Info

Publication number
CN110443305A
CN110443305A CN201910722239.2A CN201910722239A CN110443305A CN 110443305 A CN110443305 A CN 110443305A CN 201910722239 A CN201910722239 A CN 201910722239A CN 110443305 A CN110443305 A CN 110443305A
Authority
CN
China
Prior art keywords
characteristic series
characteristic
value
series
type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910722239.2A
Other languages
Chinese (zh)
Inventor
李倩兰
袁灿
于政
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Mininglamp Software System Co ltd
Original Assignee
Beijing Mininglamp Software System Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Mininglamp Software System Co ltd filed Critical Beijing Mininglamp Software System Co ltd
Priority to CN201910722239.2A priority Critical patent/CN110443305A/en
Publication of CN110443305A publication Critical patent/CN110443305A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of self-adaptive features processing method and processing devices, this method comprises: data column are divided into different types of characteristic series, wherein the type of the characteristic series includes at least following one: discrete, continuous, date and text;Feature pretreatment is carried out to sorted characteristic series;The characteristic that screening acquisition is used for model training is carried out to pretreated characteristic series.In the present invention, by realizing self-adaptive features processing, so as to which characteristic processing technical threshold is effectively reduced, working efficiency is promoted.

Description

Self-adaptive features processing method and processing device
Technical field
The present invention relates to data processing fields, in particular to a kind of self-adaptive features processing method and processing device.
Background technique
With the development of machine learning, the use scope of machine learning is more and more wider, and the people of different industries want to make Actual scene is solved the problems, such as with machine learning.In order to reduce the threshold that ordinary people uses machine learning, automaton study It is development trend.But before scanning machine device learning model, need first to pre-process characteristic.Due to different Feature shows different data characteristics, in feature pretreatment, needs to do different pretreatments for different characteristic series.It is complete Data are meticulously analyzed to face, are a quite time-consuming job.In order to improve the efficiency of data prediction, for different characteristics Characteristic series adaptively carry out pretreatment be current machine learning industry urgent problem to be solved.
Existing characteristic series pretreating scheme, often by having the personnel on certain algorithm basis progress.General approach is: 1) metadata is observed, all describe the information of data including field explanation, data source, code table etc.;2) a part is extracted Data have one to get information about data itself, the pretreatment after being is prepared using manually mode is checked;3) data Analysis: supplemental characteristic selection is carried out using some statistical analysis, visual analysis method.
Since existing scheme is operated by algorithm skilled addressee, characteristic processing is carried out by artificial observation again, needed Certain statistics and algorithm knowledge are wanted, personnel's operation generally without background context is not suitable for.General staff is in order to use engineering Learning method, needs to take a significant amount of time and is learnt in advance, is unfavorable for the working efficiency of general staff, is unfavorable for improving production Power.
Summary of the invention
The embodiment of the invention provides a kind of self-adaptive features processing method and processing devices, at least to solve phase in the related technology It closes and needs to carry out the pretreated problem of characteristic series by professional.
According to one embodiment of present invention, a kind of self-adaptive features processing method is provided, comprising: be divided into data column Different types of characteristic series, wherein the type of the characteristic series includes at least following one: discrete, continuous, date and text; Feature pretreatment is carried out to sorted characteristic series;The spy that screening acquisition is used for model training is carried out to pretreated characteristic series Levy data.
Optionally, before to the progress feature pretreatment of sorted characteristic series, further includes: calculate the null value of each characteristic series Rate, and weed out the characteristic series that null value rate is greater than the first predetermined threshold;It is filled a vacancy value using mode the characteristic series of discrete type, to company The characteristic series of continuous type are filled a vacancy value using mean value, use null character string to fill a vacancy value the characteristic series on date or text type.
Optionally, carrying out feature pretreatment to sorted characteristic series includes at least one of: to the spy of discrete type Sign column carry out one-hot coding or Histogram Mapping;Branch mailbox or normalized are carried out to the characteristic series of continuous type;By date class Date in the characteristic series of type is processed into discrete value or successive value, and generate new discrete type characteristic series or continuous type Characteristic series;The characteristic series of text type are segmented to constitute word set.
Optionally, to pretreated characteristic series carry out screening obtain for model training characteristic include at least with It is one of lower: the pretreated characteristic series of feature being screened, the feature for being lower than the second preset threshold with label correlation is removed; The non-duplicate value number for checking each characteristic series, weeds out the characteristic series of single value;Correlation between the column and the column is calculated, and is weeded out Correlation is greater than the characteristic series of third predetermined threshold value;The importance of characteristic series is calculated, and weeds out importance and presets threshold lower than third The characteristic series of value.
According to another embodiment of the invention, a kind of self-adaptive features processing unit is provided, comprising: categorization module, For data column to be divided into different types of characteristic series, wherein the type of the characteristic series includes at least following one: discrete, Continuously, date and text;Preprocessing module, for being pre-processed to different types of characteristic series using corresponding feature;Screen mould Block, for carrying out the characteristic that screening acquisition is used for model training to pretreated characteristic series.
Optionally, described device further includes filling a vacancy to be worth module, for calculating the null value rate of each characteristic series, and weeds out null value Rate is greater than the characteristic series of the first predetermined threshold;It is filled a vacancy value using mode the characteristic series of discrete type, to the feature of continuous type Column are filled a vacancy value using mean value, use null character string to fill a vacancy value the characteristic series on date or text type.
Optionally, the preprocessing module includes at least at least one of: the first pretreatment unit, for discrete class The characteristic series of type carry out one-hot coding or Histogram Mapping;Second pretreatment unit is carried out for the characteristic series to continuous type Branch mailbox or normalized;Third pretreatment unit, for the date in the characteristic series by date type be processed into discrete value or Successive value, and generate the characteristic series of new discrete type or the characteristic series of continuous type;4th pretreatment unit, for text The characteristic series of type are segmented to constitute word set.
Optionally, the screening module includes at least following one: the first screening unit, for pretreated to feature Characteristic series are screened, and the feature for being lower than the second preset threshold with label correlation is removed;Second screening unit, it is every for checking The non-duplicate value number of one characteristic series, weeds out the characteristic series of single value;Third filtering unit, for calculating phase between the column and the column Guan Xing, and weed out the characteristic series that correlation is greater than third predetermined threshold value;4th screening unit, for calculating the important of characteristic series Property, and weed out the characteristic series that importance is lower than third predetermined threshold value.
According to still another embodiment of the invention, a kind of storage medium is additionally provided, meter is stored in the storage medium Calculation machine program, wherein the computer program is arranged to execute the step in any of the above-described embodiment of the method when operation.
According to still another embodiment of the invention, a kind of electronic device, including memory and processor are additionally provided, it is described Computer program is stored in memory, the processor is arranged to run the computer program to execute any of the above-described Step in embodiment of the method.
In the above embodiment of the invention, using the different characteristics of data, it is divided into discrete, continuous, date, text type Characteristic series, different feature pretreatment strategies is used to different types of characteristic series, and special by feature selecting Policy Filtering Sign, obtains characteristic to the end, is used for model training.Therefore, input feature vector column type and some parameter configurations are only needed, so that it may To carry out characteristic processing automatically.
Detailed description of the invention
The drawings described herein are used to provide a further understanding of the present invention, constitutes part of this application, this hair Bright illustrative embodiments and their description are used to explain the present invention, and are not constituted improper limitations of the present invention.In the accompanying drawings:
Fig. 1 is the flow chart of self-adaptive features processing method according to an embodiment of the present invention;
Fig. 2 is the schematic diagram of self-adaptive features processing according to an embodiment of the present invention;
Fig. 3 is the characteristic processing flow chart of insurance data according to an embodiment of the present invention;
Fig. 4 is the structural schematic diagram of self-adaptive features processing unit according to embodiments of the present invention;
Fig. 5 is the structural schematic diagram of the self-adaptive features processing unit of alternative embodiment according to the present invention.
Specific embodiment
Hereinafter, the present invention will be described in detail with reference to the accompanying drawings and in combination with Examples.It should be noted that not conflicting In the case of, the features in the embodiments and the embodiments of the present application can be combined with each other.
It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, " Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.
A kind of self-adaptive features processing method is provided in the present embodiment, and Fig. 1 is according to the method for the embodiment of the present invention Flow chart, as shown in Figure 1, the process includes the following steps:
Step S102, by data column be divided into different types of characteristic series, wherein the type of the characteristic series include at least with It is one of lower: discrete, continuous, date and text;
Step S104 carries out feature pretreatment to sorted characteristic series;
Step S106 carries out the characteristic that screening acquisition is used for model training to pretreated characteristic series.
In the step S102 of the present embodiment, data column can be divided by discrete, continuous, day according to data category feature Phase, text type characteristic series, so as to data arrange carry out feature pretreatment.
It can also include: to calculate the null value rate of each characteristic series, and weed out null value before the step S104 of the present embodiment Rate is greater than the characteristic series of the first predetermined threshold;It is filled a vacancy value using mode the characteristic series of discrete type, to the feature of continuous type Column are filled a vacancy value using mean value, use null character string to fill a vacancy value the characteristic series on date or text type.
In the step S104 of the present embodiment, one-hot coding or Histogram Mapping are carried out to the characteristic series of discrete type;It is right The characteristic series of continuous type carry out branch mailbox or normalized;By the date in the characteristic series of date type be processed into discrete value or Successive value, and generate the characteristic series of new discrete type or the characteristic series of continuous type;The characteristic series of text type are divided Word is to constitute word set.
In the step S106 of the present embodiment, the pretreated characteristic series of feature are screened, are removed related to label Property be lower than the second preset threshold feature;The non-duplicate value number for checking each characteristic series, weeds out the characteristic series of single value;It calculates Correlation between the column and the column, and weed out the characteristic series that correlation is greater than third predetermined threshold value;The importance of characteristic series is calculated, and Weed out the characteristic series that importance is lower than third predetermined threshold value.
In the above-described embodiments, using the different characteristics of data, it is divided into the feature on discrete, continuous, date, text type Column use different feature pretreatment strategies to different types of characteristic series, and by feature selecting Policy Filtering feature, obtain Last characteristic is used for model training.Therefore, input feature vector column type and some parameter configurations are only needed, so that it may automatic Carry out characteristic processing.
For the ease of being provided for the embodiments of the invention the understanding of technical solution, a concrete application is provided below Embodiment is specifically described.
Firstly, being explained to terms some in the prior art involved in the present embodiment as follows:
One-hot coding: indicating N kind discrete value using N characteristic values, such as characteristic series " gender ", there is " male " " female " two categories Property, then " male " is " 01 " after encoding, and " female " is " 10 ".
Minimax normalizing: method for normalizing is normalized to all values between [0,1], wherein MinValue is all values In minimum value, MaxValue is maximum value:
Y=(x-MinValue)/(MaxValue-MinValue)
Z-score standardization: its formula is as follows, and wherein x is initial value, and μ is average, and σ is standard deviation: y=(x- μ)/σ.
Log standardization: the standardization in the case where differing greatly for magnitude.Wherein, lg is denary logarithm function Conversion:
Y=lg (x)/lg (MaxValue)
Pearson's coefficient: related coefficient calculation method, calculation formula is as follows, wherein X, and Y is two column features, σXFor X's Standard deviation, σYIt is the standard deviation of Y, μX, μYIt is X respectively, the expectation of Y:
Gini coefficient: parameter when decision tree branch mailbox measures the size of uncertainty, its calculation formula is:
Wherein, pkFor the accounting of kth class label.
Chi-square value: parameter when card side's branch mailbox, chi-square value are used for the correlation of inspection data, and calculation is such as Under:
Wherein, AijFor the quantity of the i-th section jth class label, EijFor AijExpectation:Wherein, N is total Sample number, NiIt is the sample number in the i-th section, pjFor the ratio in all samples of jth class label.
In the present embodiment, firstly, according to data column classification feature, it is divided into discrete, continuous, date, four class of text, logarithm Feature pretreatment is carried out according to column.Then, according to feature selecting strategy, the pretreated feature of feature is screened, removal with The low feature of label correlation.
As shown in Fig. 2, specific successively carry out in accordance with the following steps:
1) classify to all characteristic series: according to discrete, continuous, date, four type of text can be divided into the characteristics of data Type, wherein classification can be obtained by manual analysis or rule analysis.
Then, characteristic processing is carried out to sorted characteristic series, this feature processing includes mainly fill a vacancy characteristic series value, spy Sign pretreatment and feature selecting.
2) it fills a vacancy value to characteristic series: before value of filling a vacancy characteristic series, first calculating the null value rate of each column, weed out null value rate Greater than the characteristic series of threshold value k, threshold value k can freely be set, and can be defaulted as 95%.It fills a vacancy in characteristic series again value, wherein value of filling a vacancy It can be used such as under type:
2.1) it is discrete characteristic series to classification, is filled a vacancy value with mode;
2.2) it is continuous characteristic series to classification, is filled a vacancy value with mean value;
2.3) be to classification date and text characteristic series, filled a vacancy value with null character string.
3) pre- characteristic processing:
3.1) it is handled for the characteristic series that classification is the date, process are as follows:
3.1.1 it) selects when the earliest date the latest in column, calculates time interval day of all dates away from the earliest date Number, as new characteristic series;
3.1.2 the new feature column in previous step a)) are assigned into continuous one kind;
3.1.3) calculate whether all dates are weekend, increase a column represent whether be weekend characteristic series, field type For int type number, being then is 1 at weekend, and not being is then 0 at weekend;
It 3.1.4 is the morning according to 5:00-11:00 in the date) for the time for including, 11:00-14:00 is noon, 14: 00-18:00 is afternoon, and 18:00-23:00 is at night that 23:00-5:00 is the late into the night, falls into 5 types, is indicated respectively with 1-5;
3.1.5 time sorted in c column) are divided into discrete one kind;
3.1.6) leave out original days column.
3.2) it is handled for the characteristic series that classification is text, process are as follows:
3.2.1) all sentences are segmented, remove stop words, list all words, constitute word set;
3.2.2 the TF-IDF of each word) is calculated, in which:
The total word number of number/text that TF (the word frequency)=word occurs in the text,
IDF (inverse document frequency)=log (text sum/(textual data+1 comprising the word)),
TF-IDF=TF*IDF.
3.2.3 the keyword of each text) is calculated, the keyword of same number is therefrom respectively chosen, is merged into a collection It closes;
3.2.4 each text) is calculated for the word frequency of the word in this set, generates respective word frequency vector.
It 3.3) is that discrete characteristic series are handled for classification.For example, carry out one-hot coding (one-hotencode) or Histogram Mapping (histogram mapping).
Wherein, the way of Histogram Mapping is: calculating the ratio of each label in the every attribute of the column.Such as characteristic series " property Not ", there is " male " and " female " 2 attribute, label shows 0,1,2 three kind of label, and attribute is in the sample of " male ", and 0, which accounts for 1/2,1, accounts for 1/ 3,2 account for 1/6, then all properties be " male " sample process after feature be [1/2,1/3,1/6].
It 3.4) is that continuous characteristic series are handled for classification.Such as branch mailbox or normalization can be carried out.
It is described as follows about branch mailbox:
Branch mailbox does not have to setting branch mailbox section, according to data distribution, using automatic setting branch mailbox section, in the present embodiment, It is divided into card side's branch mailbox and decision tree branch mailbox.
Wherein, the process of card side's branch mailbox are as follows:
The threshold value of step 1, the threshold value of one card side of setting or branch mailbox number;
The value of the column is ranked up by step 2, and each value is a section;
Chi-square value between step 3, calculating adjacent interval merges the smallest a pair of of the section of chi-square value.
Step 4 repeats step 3, until the chi-square value in all sections is greater than the threshold value set or section quantity less than setting Threshold value;
Step 5 carries out branch mailbox to the column with the branch mailbox section that step 4 obtains, if branch mailbox number is n, the class after branch mailbox It Wei not 1-n.
It wherein, is CART decision tree branch mailbox, process by the way of for decision tree branch mailbox are as follows:
Smallest sample amount after step 1, setting segmentation;
Step 2 is ranked up all values, is a split point between every two consecutive value x, y, and division boundary is set to (x +y)/2;
Step 3, to each split point, Gini coefficient after computation partition takes the smallest division points of Gini coefficient.
Step 4, after segmentation current branch quantity be less than step 1 set smallest sample amount when, stop division;It is no Then, step 3 is repeated;
The branch mailbox section that step 5. is obtained according to step 4 carries out branch mailbox to the column feature.
About normalization, there is minimax normalizing, z-score standardization and log standardize three kinds of modes, process Are as follows: the ratio of the column maximin, i.e. max/min are calculated, if more than 104, then standardized using log;Otherwise, according to preparatory The mode of setting is normalized.
4) feature selecting: pretreated feature is screened, process are as follows:
4.1) the non-duplicate value number for checking each column, weeds out the characteristic series of single value;
4.2) correlation between the column and the column is calculated, the characteristic series that correlation is greater than threshold value m are weeded out.Wherein relativity measurement Using Pearson's coefficient;
4.3) importance for calculating characteristic series weeds out the characteristic series that importance is lower than threshold value, and process is as follows:
The feature quantity that step 1, setting feature selecting retain;
Step 2 is trained data column using Random Forest model, obtains feature importance;
Step 3, the feature quantity set according to step 1, retain the biggish feature of importance.
Specific embodiment below by taking insurance data as an example is as follows:
In the present embodiment, data instance such as the following table 1 is obtained from certain insurance company, altogether about 30,000,000 datas:
Table 1
As shown in Fig. 2, the present embodiment mainly includes the following steps:
S301 classifies to characteristic series, classification results are as follows:
S302 handles " finally insuring the date ", searches the earliest date, and all dates subtract the earliest date, obtain It is spaced number of days, saving as new column, " _ number of days of finally insuring the date, which is continuous.To the date calculate whether weekend, save as new It arranges " finally insuring the date _ weekend ", which is discrete.Calculating the time is morning/afternoon/evening/late into the night, saves as new column It " finally insures the date _ the period, which is discrete.
S303 handles the column of " address " one, specifically includes:
Successive value is handled, successive value column include: gross premium, finally insure the date _ days.
Discrete value is handled, discrete value column include: time, channel of insuring, finally insure the date _ weekend, finally throw Protect date _ period.
Step S304 carries out feature selecting to pretreated feature.
Step S305, the feature after obtaining self-adaptive processing, input machine learning model are trained.
In the above embodiment of the present invention, the method for a set of self-adaptive features processing is provided, can be effectively reduced feature Processing technique threshold promotes working efficiency.
Through the above description of the embodiments, those skilled in the art can be understood that according to above-mentioned implementation The method of example can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but it is very much In the case of the former be more preferably embodiment.Based on this understanding, technical solution of the present invention is substantially in other words to existing The part that technology contributes can be embodied in the form of software products, which is stored in a storage In medium (such as ROM/RAM, magnetic disk, CD), including some instructions are used so that a terminal device (can be mobile phone, calculate Machine, server or network equipment etc.) execute method described in each embodiment of the present invention.
Additionally provide a kind of self-adaptive features processing unit in the present embodiment, the device for realizing above-described embodiment and Preferred embodiment, the descriptions that have already been made will not be repeated.As used below, term " module " and " unit " can be real The combination of the software and/or hardware of existing predetermined function.Although device described in following embodiment is preferably realized with software, But the realization of the combination of hardware or software and hardware is also that may and be contemplated.
Fig. 4 is the structural block diagram of self-adaptive features processing unit according to an embodiment of the present invention, as shown in Fig. 2, the device Including categorization module 10, preprocessing module 20 and screening module 30.
Categorization module 10, for data column to be divided into different types of characteristic series, wherein the type of the characteristic series is at least Including following one: discrete, continuous, date and text.
Preprocessing module 20, for being pre-processed to different types of characteristic series using corresponding feature.
Screening module 30, for carrying out the characteristic that screening acquisition is used for model training to pretreated characteristic series.
Fig. 5 is the structural block diagram of the self-adaptive features processing unit of alternative embodiment according to the present invention, as shown in figure 5, should Device further includes filling a vacancy to be worth module 40 in addition to including categorization module 10 shown in Fig. 4, preprocessing module 20 and screening module 30.
It fills a vacancy and is worth module 40, for calculating the null value rate of each characteristic series, and weed out null value rate greater than the first predetermined threshold Characteristic series;It is filled a vacancy value using mode the characteristic series of discrete type, uses mean value to fill a vacancy value the characteristic series of continuous type, it is right The characteristic series of date or text type are filled a vacancy value using null character string.
In the present embodiment, the preprocessing module 20 can also include at least one of: the first pretreatment unit 201, one-hot coding or Histogram Mapping are carried out for the characteristic series to discrete type;Second pretreatment unit 202, for even The characteristic series of continuous type carry out branch mailbox or normalized;Third pretreatment unit 203, in the characteristic series by date type Date be processed into discrete value or successive value, and generate the characteristic series of new discrete type or the characteristic series of continuous type;4th Pretreatment unit 204 is segmented for the characteristic series to text type to constitute word set.
In the present embodiment, the screening module 30 can also include at least following one: the first screening unit 301, use It is screened in the pretreated characteristic series of feature, removes the feature for being lower than the second preset threshold with label correlation;Second Screening unit 302 weeds out the characteristic series of single value for checking the non-duplicate value number of each characteristic series;Third filtering unit 303, for calculating correlation between the column and the column, and weed out the characteristic series that correlation is greater than third predetermined threshold value;4th screening Unit 304 for calculating the importance of characteristic series, and weeds out the characteristic series that importance is lower than third predetermined threshold value.
It should be noted that above-mentioned modules can be realized by software or hardware, for the latter, Ke Yitong Following manner realization is crossed, but not limited to this: above-mentioned module is respectively positioned in same processor;Alternatively, above-mentioned modules are with any Combined form is located in different processors.
The embodiments of the present invention also provide a kind of storage medium, computer program is stored in the storage medium, wherein The computer program is arranged to execute the step in any of the above-described embodiment of the method when operation.
Optionally, in the present embodiment, above-mentioned storage medium can include but is not limited to: USB flash disk, read-only memory (Read- Only Memory, referred to as ROM), it is random access memory (Random Access Memory, referred to as RAM), mobile hard The various media that can store computer program such as disk, magnetic or disk.
The embodiments of the present invention also provide a kind of electronic device, including memory and processor, stored in the memory There is computer program, which is arranged to run computer program to execute the step in any of the above-described embodiment of the method Suddenly.
Optionally, above-mentioned electronic device can also include transmission device and input-output equipment, wherein the transmission device It is connected with above-mentioned processor, which connects with above-mentioned processor.
Optionally, the specific example in the present embodiment can be with reference to described in above-described embodiment and optional embodiment Example, details are not described herein for the present embodiment.
Obviously, those skilled in the art should be understood that each module of the above invention or each step can be with general Computing device realize that they can be concentrated on a single computing device, or be distributed in multiple computing devices and formed Network on, optionally, they can be realized with the program code that computing device can perform, it is thus possible to which they are stored It is performed by computing device in the storage device, and in some cases, it can be to be different from shown in sequence execution herein Out or description the step of, perhaps they are fabricated to each integrated circuit modules or by them multiple modules or Step is fabricated to single integrated circuit module to realize.In this way, the present invention is not limited to any specific hardware and softwares to combine.
The foregoing is only a preferred embodiment of the present invention, is not intended to restrict the invention, for the skill of this field For art personnel, the invention may be variously modified and varied.It is all within principle of the invention, it is made it is any modification, etc. With replacement, improvement etc., should all be included in the protection scope of the present invention.

Claims (10)

1. a kind of self-adaptive features processing method characterized by comprising
Data column are divided into different types of characteristic series, wherein the type of the characteristic series includes at least following one: discrete, Continuously, date and text;
Feature pretreatment is carried out to sorted characteristic series;
The characteristic that screening acquisition is used for model training is carried out to pretreated characteristic series.
2. the method according to claim 1, wherein to sorted characteristic series carry out feature pretreatment before, Further include:
The null value rate of each characteristic series is calculated, and weeds out the characteristic series that null value rate is greater than the first predetermined threshold;
It is filled a vacancy value using mode the characteristic series of discrete type, uses mean value to fill a vacancy value the characteristic series of continuous type, to the date Or the characteristic series of text type are filled a vacancy value using null character string.
3. the method according to claim 1, wherein wherein, carrying out feature pretreatment to sorted characteristic series Including at least one of:
One-hot coding or Histogram Mapping are carried out to the characteristic series of discrete type;
Branch mailbox or normalized are carried out to the characteristic series of continuous type;
Date in the characteristic series of date type is processed into discrete value or successive value, and generates the characteristic series of new discrete type Or the characteristic series of continuous type;
The characteristic series of text type are segmented to constitute word set.
4. obtaining the method according to claim 1, wherein carrying out screening to pretreated characteristic series for mould The characteristic of type training includes at least following one:
The pretreated characteristic series of feature are screened, the feature for being lower than the second preset threshold with label correlation is removed;
The non-duplicate value number for checking each characteristic series, weeds out the characteristic series of single value;
Correlation between the column and the column is calculated, and weeds out the characteristic series that correlation is greater than third predetermined threshold value;
The importance of characteristic series is calculated, and weeds out the characteristic series that importance is lower than third predetermined threshold value.
5. a kind of self-adaptive features processing unit characterized by comprising
Categorization module, for by data column be divided into different types of characteristic series, wherein the type of the characteristic series include at least with It is one of lower: discrete, continuous, date and text;
Preprocessing module, for being pre-processed to different types of characteristic series using corresponding feature;
Screening module, for carrying out the characteristic that screening acquisition is used for model training to pretreated characteristic series.
6. device according to claim 5, which is characterized in that further include:
It fills a vacancy and is worth module, for calculating the null value rate of each characteristic series, and weed out the feature that null value rate is greater than the first predetermined threshold Column;Filled a vacancy value using mode the characteristic series of discrete type, use mean value to fill a vacancy value the characteristic series of continuous type, to the date or The characteristic series of text type are filled a vacancy value using null character string.
7. device according to claim 5, which is characterized in that the preprocessing module includes at least at least one of:
First pretreatment unit carries out one-hot coding or Histogram Mapping for the characteristic series to discrete type;
Second pretreatment unit carries out branch mailbox or normalized for the characteristic series to continuous type;
Third pretreatment unit is processed into discrete value or successive value for the date in the characteristic series by date type, and generates The characteristic series of the characteristic series of new discrete type or continuous type;
4th pretreatment unit is segmented for the characteristic series to text type to constitute word set.
8. device according to claim 5, which is characterized in that the screening module includes at least following one:
First screening unit, for screening to the pretreated characteristic series of feature, removal is lower than second with label correlation The feature of preset threshold;
Second screening unit weeds out the characteristic series of single value for checking the non-duplicate value number of each characteristic series;
Third filtering unit for calculating correlation between the column and the column, and weeds out the spy that correlation is greater than third predetermined threshold value Sign column;
4th screening unit for calculating the importance of characteristic series, and weeds out the characteristic series that importance is lower than third predetermined threshold value.
9. a kind of storage medium, which is characterized in that be stored with computer program in the storage medium, wherein the computer Program is arranged to execute method described in any one of Claims 1-4 when operation.
10. a kind of electronic device, including memory and processor, which is characterized in that be stored with computer journey in the memory Sequence, the processor are arranged to run the computer program to execute side described in any one of Claims 1-4 Method.
CN201910722239.2A 2019-08-06 2019-08-06 Self-adaptive features processing method and processing device Pending CN110443305A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910722239.2A CN110443305A (en) 2019-08-06 2019-08-06 Self-adaptive features processing method and processing device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910722239.2A CN110443305A (en) 2019-08-06 2019-08-06 Self-adaptive features processing method and processing device

Publications (1)

Publication Number Publication Date
CN110443305A true CN110443305A (en) 2019-11-12

Family

ID=68433578

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910722239.2A Pending CN110443305A (en) 2019-08-06 2019-08-06 Self-adaptive features processing method and processing device

Country Status (1)

Country Link
CN (1) CN110443305A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112732934A (en) * 2021-01-11 2021-04-30 国网山东省电力公司电力科学研究院 Power grid equipment word segmentation dictionary and fault case library construction method

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102194013A (en) * 2011-06-23 2011-09-21 上海毕佳数据有限公司 Domain-knowledge-based short text classification method and text classification system
CN105786860A (en) * 2014-12-23 2016-07-20 华为技术有限公司 Data processing method and device in data modeling
CN108616373A (en) * 2016-12-12 2018-10-02 中国科学院深圳先进技术研究院 Frequency spectrum entropy prediction technique and system
CN108897834A (en) * 2018-06-22 2018-11-27 招商信诺人寿保险有限公司 Data processing and method for digging
CN109739844A (en) * 2018-12-26 2019-05-10 西安电子科技大学 Data classification method based on decaying weight
CN109799269A (en) * 2019-01-24 2019-05-24 山东工商学院 Electronic nose gas sensor array optimization method based on behavioral characteristics different degree
CN109800790A (en) * 2018-12-24 2019-05-24 厦门大学 A kind of feature selection approach towards high dimensional data

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102194013A (en) * 2011-06-23 2011-09-21 上海毕佳数据有限公司 Domain-knowledge-based short text classification method and text classification system
CN105786860A (en) * 2014-12-23 2016-07-20 华为技术有限公司 Data processing method and device in data modeling
CN108616373A (en) * 2016-12-12 2018-10-02 中国科学院深圳先进技术研究院 Frequency spectrum entropy prediction technique and system
CN108897834A (en) * 2018-06-22 2018-11-27 招商信诺人寿保险有限公司 Data processing and method for digging
CN109800790A (en) * 2018-12-24 2019-05-24 厦门大学 A kind of feature selection approach towards high dimensional data
CN109739844A (en) * 2018-12-26 2019-05-10 西安电子科技大学 Data classification method based on decaying weight
CN109799269A (en) * 2019-01-24 2019-05-24 山东工商学院 Electronic nose gas sensor array optimization method based on behavioral characteristics different degree

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112732934A (en) * 2021-01-11 2021-04-30 国网山东省电力公司电力科学研究院 Power grid equipment word segmentation dictionary and fault case library construction method
CN112732934B (en) * 2021-01-11 2022-05-27 国网山东省电力公司电力科学研究院 Power grid equipment word segmentation dictionary and fault case library construction method

Similar Documents

Publication Publication Date Title
CN110837931A (en) Customer churn prediction method, device and storage medium
JP2021504789A (en) ESG-based corporate evaluation execution device and its operation method
CN109710766B (en) Complaint tendency analysis early warning method and device for work order data
CN111489201A (en) Method, device and storage medium for analyzing customer value
CN113449046A (en) Model training method, system and related device based on enterprise knowledge graph
CN108241867A (en) A kind of sorting technique and device
CN114491034B (en) Text classification method and intelligent device
CN113435859A (en) Letter processing method and device, electronic equipment and computer readable medium
CN115794798A (en) Market supervision informationized standard management and dynamic maintenance system and method
CN114817681A (en) Financial wind control system based on big data analysis and management equipment thereof
CN112950347B (en) Resource data processing optimization method and device, storage medium and terminal
CN110443305A (en) Self-adaptive features processing method and processing device
CN113704389A (en) Data evaluation method and device, computer equipment and storage medium
CN109558887A (en) A kind of method and apparatus of predictive behavior
CN112231299A (en) Method and device for dynamically adjusting feature library
CN115660451A (en) Supplier risk early warning method, device, equipment and medium based on RPA
CN115660730A (en) Loss user analysis method and system based on classification algorithm
CN113297289A (en) Method and device for extracting business data from database and electronic equipment
CN113743752A (en) Data processing method and device
CN112948583A (en) Data classification method and device, storage medium and electronic device
CN115080732A (en) Complaint work order processing method and device, electronic equipment and storage medium
CN111027296A (en) Report generation method and system based on knowledge base
CN110895564A (en) Potential customer data processing method and device
KR20190104745A (en) Issue interest based news value evaluation apparatus and method, storage media storing the same
CN115982646B (en) Management method and system for multisource test data based on cloud platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20191112

RJ01 Rejection of invention patent application after publication