CN110334720A - Feature extracting method, device, server and the storage medium of business datum - Google Patents

Feature extracting method, device, server and the storage medium of business datum Download PDF

Info

Publication number
CN110334720A
CN110334720A CN201810289688.8A CN201810289688A CN110334720A CN 110334720 A CN110334720 A CN 110334720A CN 201810289688 A CN201810289688 A CN 201810289688A CN 110334720 A CN110334720 A CN 110334720A
Authority
CN
China
Prior art keywords
rule
business datum
target
feature
dimensionality reduction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810289688.8A
Other languages
Chinese (zh)
Inventor
刘昊骋
丁磊
徐西孟
宫健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baidu Online Network Technology Beijing Co Ltd
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201810289688.8A priority Critical patent/CN110334720A/en
Publication of CN110334720A publication Critical patent/CN110334720A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

The embodiment of the invention discloses a kind of feature extracting method of business datum, device, server and storage mediums, this method comprises: at least one in the target code rule of determining business datum, target normalization rule and target dimensionality reduction rule;Wherein, the target code rule is determined from each candidate code rule being provided previously, the target normalization rule is determined from each candidate feature normalization rule being provided previously, and the target dimensionality reduction rule is determined from each candidate dimensionality reduction rule being provided previously;According at least one in the target code rule of the business datum, target normalization rule and target dimensionality reduction rule, the feature vector of the business datum is determined.Automatically generating for the corresponding feature vector of business datum can be completed by the configuration parameter of modification Feature Engineering in the embodiment of the present invention.Modularization, automation and the reusability for realizing Feature Engineering improve the formation efficiency and accuracy of feature vector.

Description

Feature extracting method, device, server and the storage medium of business datum
Technical field
The present embodiments relate to machine learning techniques field more particularly to a kind of feature extracting method of business datum, Device, server and storage medium.
Background technique
With the continuous development that computer technology and big data are applied, more and more technical fields can all be based on big data Carry out machine learning modeling makes each electronic product provide more humane user experience to imitate the thoughtcast of the mankind.
The premise of machine learning modeling is to handle business datum, obtains after simplifying and can represent business number completely According to the feature vector of feature.Based on feature vector carry out machine learning model building, improve model building efficiency and Accuracy.Existing machine learning Modeling Platform provides the pattern manipulation interface developed convenient for research staff, researches and develops people Although member does not have to write a large amount of program code progress business data processing, feature vector is obtained carrying out Feature Engineering Process, to business datum do field feature extract, the operation such as feature coding and dimensionality reduction when, it is still necessary to according to service concept manually by One processing, manually carries out the Feature Engineering of feature coding, normalization and dimensionality reduction.
However, the artificial treatment mode limitation of Feature Engineering is very big.The lesser business datum of characteristic dimension is also artificial In the range of being capable of handling, once but characteristic dimension increase, the artificial mode for carrying out Feature Engineering will expend a large amount of manpowers and Time, and user needs repeatedly to attempt Feature Engineering each method with Optimized model.Sample data unbalanced or abnormal simultaneously is also It can have a adverse impact to modeling effect.And then research staff needs to take much time and does the Feature Engineering and sample of repeatability This analysis can not quickly meet business demand and model iteration so that the model online period is very long.
Summary of the invention
The embodiment of the invention provides a kind of feature extracting method of business datum, device, server and storage medium, energy Modularization, automation and the reusability for enough realizing Feature Engineering, improve the formation efficiency and accuracy of feature vector.
In a first aspect, the embodiment of the invention provides a kind of feature extracting methods of business datum, comprising:
Determine at least one in the target code rule, target normalization rule and target dimensionality reduction rule of business datum; And
According at least one in the target code rule of the business datum, target normalization rule and target dimensionality reduction rule , determine the feature vector of the business datum.
Second aspect, the embodiment of the invention provides a kind of feature deriving means of business datum, comprising:
Rule configuration module, for determining the target code rule, target normalization rule and target dimensionality reduction of business datum At least one of in rule;
Feature generation module, for target code rule, target normalization rule and the target according to the business datum At least one of in dimensionality reduction rule, determine the feature vector of the business datum.
The third aspect, the embodiment of the invention provides a kind of servers, comprising:
One or more processors;
Memory, for storing one or more programs;
When one or more of programs are executed by one or more of processors, so that one or more of processing Device realizes the feature extracting method of business datum described in any embodiment of that present invention.
Fourth aspect, the embodiment of the invention provides a kind of computer readable storage mediums, are stored thereon with computer journey Sequence realizes the feature extracting method of business datum described in any embodiment of that present invention when the program is executed by processor.
The embodiment of the present invention is by determining business datum according to the configuration parameter for modifying Feature Engineering the characteristics of business datum Target code rule, in target normalization rule and target dimensionality reduction rule at least one of, generate industry according to the rule determined The feature vector for data of being engaged in.The embodiment of the present invention makes research staff only need to be from the allocation optimum angle that operational angle or system generate The configuration parameter of degree modification Feature Engineering, business datum is associated with Feature Engineering, the corresponding feature of business datum can be completed Vector automatically generates.Modularization, automation and the reusability for realizing Feature Engineering, improve the formation efficiency of feature vector And accuracy.
Detailed description of the invention
Fig. 1 is a kind of flow chart of the feature extracting method for business datum that the embodiment of the present invention one provides;
Fig. 2 is a kind of flow chart of the feature extracting method of business datum provided by Embodiment 2 of the present invention;
Fig. 3 is the exemplary diagram of each characteristic processing links configurable rule provided by Embodiment 2 of the present invention;
Fig. 4 is the process of the model training mode provided by Embodiment 2 of the present invention that platform is automated based on Feature Engineering Figure;
Fig. 5 is a kind of structural schematic diagram of the feature deriving means for business datum that the embodiment of the present invention three provides;
Fig. 6 is a kind of structural schematic diagram for server that the embodiment of the present invention four provides.
Specific embodiment
The embodiment of the present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this Locate described specific embodiment to be used only for explaining the present invention rather than limiting the invention.It also should be noted that For ease of description, only parts related to embodiments of the present invention are shown in attached drawing rather than entire infrastructure.
Embodiment one
Fig. 1 is a kind of flow chart of the feature extracting method for business datum that the embodiment of the present invention one provides, the present embodiment It is applicable to carry out the case where Feature Engineering is to generate feature vector to business datum, this method can be mentioned by the feature of business datum Device is taken to execute.This method specifically comprises the following steps:
In S110, the target code rule for determining business datum, target normalization rule and target dimensionality reduction rule at least One.
In the specific embodiment of the invention, business datum refers to the data to be analyzed for constructing machine learning model, example Such as the registration user information of certain website, the data informations such as age, occupation and income comprising each user can pass through crawler technology Or the various ways such as access database obtain required business datum.
Target code rule, target normalization rule and target dimensionality reduction rule, which refer to, targetedly matches for the business datum The data processing rule set.Wherein, target code rule defines the feature coding method of the business datum.Target code rule It realizes and converts the numerical identity that computer can identify for the attributive character of original business datum.Such as One-Hot coding That is the various features coding methods such as an efficient coding, numerical value mapping and Interval Maps.Target normalization rule defines this The feature normalization method of business datum.Target normalization rule is realized characteristic bi-directional scaling, is allowed to fall into one Small specific sections remove the unit limitation of data, nondimensional pure values are translated into, convenient for not commensurate or magnitude Index, which is able to carry out, to be compared and weights.Such as most typical data normalization processing is exactly that data are uniformly mapped to [0,1] area Between on.Target dimensionality reduction rule defines the feature dimension reduction method of the business datum.Target dimensionality reduction rule is realized to original feature It is reconfigured or is deleted, to reduce the dimension of feature, reduced and machine mould is produced since intrinsic dimensionality is excessive or redundancy Raw undesirable influence.Such as principal component analytical method (Principal Components Analysis, PCA) turns multi objective Turn to a few overall target.
In one embodiment, target code rule can be determined from each candidate code rule being provided previously, target What normalization rule can be determined from each candidate feature normalization rule being provided previously, similarly target dimensionality reduction rule can also be with It is determined from each candidate dimensionality reduction rule being provided previously.Wherein, candidate code rule, candidate feature normalization rule and candidate drop Dimension rule can be using service fields belonging to business datum and/or business scenario as foundation, packed for research staff in advance Selection.
It is worth noting that, system models experience according to history, it is characterized the configuration mode of engineering installation default.According to When carrying out the rule configuration of Feature Engineering according to business datum, it is thus necessary to determine that the target code rule of business datum, target normalization At least one of in rule and target dimensionality reduction rule, so that Feature Engineering can be handled data according to the rule of configuration.
Illustratively, system models experience according to history, is characterized the configuration mode of engineering installation default, age characteristics Using Interval Maps coding method, each age range is converted into corresponding scalar value.For example, indicating age area using numerical value 1 Between [0,18), using numerical value 2 indicate age range [18,30), using numerical value 3 indicate age range [30,40), and so on; Job characteristics use One-Hot coding method, such as there are three kinds of occupations to include [teacher, doctor, police], then One-Hot is compiled After code, [1,0,0] indicates teacher, and [0,1,0] indicates doctor, and [0,0,1] indicates police;Feature income and deposit are graceful using Aunar Z-score model carries out feature normalization.When research staff models, it can be modified and be configured according to own service.Meanwhile being System can automatic " trial " other configurations.For example, modification age range mapping relations be using numerical value 1 indicate age range [0, 25), using numerical value 2 indicate age range [25,45), using numerical value 3 indicate age range [45,65), and so on;It is professional special Sign modification is encoded using evidence weight (Weight of Evidence, WOE);Income and deposit feature use Min-Max i.e. most Small-maximum specification method carries out the normalized of feature.
In S120, the target code rule according to business datum, target normalization rule and target dimensionality reduction rule at least One, determine the feature vector of business datum.
In the specific embodiment of the invention, carry out Feature Engineering rule with postponing, system can be according to being configured Rule realize to the automation characteristic processing of business datum.Wherein, automation characteristic processing generally comprises following four step.
The first step is data prediction:, equally can be corresponding for data prediction configuration with the configuration flow of above-mentioned rule Preprocessing rule, using data prediction engine implementation to the filtering of the ranks of business datum and data cleansing.Wherein, row can be with The classification of feature is represented, different rows indicates different data attributes;Corresponding column can represent feature, different data attribute because Personal feature and have differences, vice versa.By configuring ranks filtering screening condition, all characteristics are read out and Screening, to remove abnormal data and junk data.
Illustratively, by taking the user data of certain bank in 2017 makees air control model as an example, user service data include gender, Age, educational background, occupation, income, consumption, real estate, debt etc., research staff can configure the reasonable interval of each field, such as Age is [0,120], and gender must be male or female, and income is not negative.And then it can be abnormal data, such as year in data cleansing Age is 200 years old or data and junk data of the income for -5 ten thousand, for example, repeated data or gender field be empty data into Row is removed.
Second step is characterized coding:, will be original using feature coding engine implementation by the target code rule configured The attributive character of business datum is converted into the numerical identity that computer can identify.Illustratively, sex character uses One-Hot Coding method, i.e., for two kinds of genders, gender male is 10 after coding, and gender female is 01.Academic feature uses the volume of numerical value mapping Code method, educational background include following five kinds, i.e., [senior middle school is hereinafter, training, undergraduate course, master, doctor], are corresponding in turn to scalar number after coding It is worth [1,2,3,4,5].Age characteristics use Interval Maps method, i.e., using numerical value 1 indicate age range [0,18), using numerical value 2 indicate age ranges [18,30), using numerical value 3 indicate age range [30,40), then the age 25 can be encoded to 2.
Third step is characterized normalization: rule is normalized by the target configured, it will using feature normalization engine implementation Characteristic bi-directional scaling is simultaneously converted into nondimensional pure values.For example, income and deposit feature use the graceful Z-score of Aunar Model carries out feature normalization.
4th step is characterized dimensionality reduction: by the target dimensionality reduction rule configured, using Feature Dimension Reduction selection engine implementation to original Some features are reconfigured or are deleted, to reduce the dimension of feature.Illustratively, according to PCA and factor-analysis approach, It can be configured to 10 characteristic synthetics, 4 character representations, and business datum changed into 4 dimensional features, wherein combination of multiple features can It is configured to occupation and income one feature of synthesis, such as booming income occupation.Simultaneously can be according to the importance of feature, it will be to mould Type influences lesser feature and leaves out.
The present embodiment can be advised by the characteristic processing process of aforementioned four step according to the items of configuration feature engineering Then, the automation characteristic processing to business datum is realized.
The technical solution of the present embodiment, by determining according to the configuration parameter for modifying Feature Engineering the characteristics of business datum Target code rule, the target of business datum normalize at least one in rule and target dimensionality reduction rule, according to determining rule Then generate the feature vector of business datum.The embodiment of the present invention generate research staff only need to from operational angle or system most The configuration parameter of excellent arrangement angles modification Feature Engineering, business datum is associated with Feature Engineering, business datum pair can be completed The feature vector answered automatically generates.Modularization, automation and the reusability for realizing Feature Engineering, improve feature vector The manpower and time cost of Feature Engineering is greatly decreased in formation efficiency and accuracy.
Embodiment two
The present embodiment on the basis of the above embodiment 1, provides one of the feature extracting method of a kind of business datum Preferred embodiment, the feature vector that can be generated using automation carry out the building of machine learning model.Fig. 2 is that the present invention is real A kind of flow chart of the feature extracting method of business datum of the offer of example two is applied, as shown in Fig. 2, this method includes walking in detail below It is rapid:
Service fields and/or business scenario belonging to S210, foundation business datum provide candidate code rule for business datum Then, at least one in candidate feature normalization rule and candidate dimensionality reduction rule.
In the specific embodiment of the invention, service fields and business scenario refer to the attributive character of business datum, wherein industry Business field refers to that attribute of business itself, business scenario refer to the attributive character of business institute application scenarios.Integrated Services Digital institute The service fields and/or business scenario of category sum up business datum feature, so as to provide time on this basis for business datum Select at least one in coding rule, candidate feature normalization rule and candidate dimensionality reduction rule.Illustratively, according to characteristic processing Process can configure preprocessing rule, such as ranks filtering rule and data cleaning rule etc. for data prediction engine;It can be with It is characterized coding engine configuration codes rule, such as One-Hot coding, numerical value mapping, Interval Maps, WOE, Logistic recurrence (LOG) and numerical discretization etc.;Normalization engine configuration normalization rule, such as Min-Max scaling, Z- can be characterized Score, data normalization and two-value conversion etc.;Can be characterized dimensionality reduction selection engine configuration dimensionality reduction rule, such as PCA, because Son analysis, combination of multiple features and feature importance etc..These candidate rules can be with service fields belonging to business datum And/or business scenario be foundation, in advance it is packed for research staff selection.
In S220, the target code rule for determining business datum, target normalization rule and target dimensionality reduction rule at least One.
In the specific embodiment of the invention, the configuration of Feature Engineering parameter may include traffic data field definition, business Scene, engine parameter of regularity and model score.Wherein, traffic data field definition and business scenario can be at business datums The selection gist of configuration rule is provided when reason, in traffic data field definition and the associated candidate rule of business scenario, for spy Sign handles each engine and configures corresponding parameter of regularity.The model quality number of historical machine learning model feedback can also be received simultaneously According to i.e. model score, with this configuration parameter optimal according to modeling effect selection.System models experience according to history, is characterized work Journey is provided with the configuration mode of default.When carrying out the rule configuration of Feature Engineering according to business datum, need from candidate rule Target code rule, the target of middle determining business datum normalize at least one in rule and target dimensionality reduction rule, so that special Sign engineering can be handled data according to the rule of configuration.
In S230, the target code rule according to business datum, target normalization rule and target dimensionality reduction rule at least One, determine the feature vector of business datum.
In the specific embodiment of the invention, carry out Feature Engineering rule with postponing, system can be according to being configured Rule realize to the automation characteristic processing of business datum.Wherein, automation characteristic processing generally comprises following four step, Including data prediction, feature coding, feature normalization and Feature Dimension Reduction.According to the items rule of research staff's configuration to industry Business data are handled, and to generate the feature vector that can sufficiently represent business datum, are modeled and are used for the later period.
S240, machine learning model is constructed using the feature vector of business datum.
In the specific embodiment of the invention, by the Feature Engineering of modularization, automation and reusability, it is embodied as business The corresponding characteristic processing rule of data configuration, according to determining target code rule, target normalization rule and target dimensionality reduction At least one of in rule, the feature vector of business datum is efficiently generated, and accuracy rate is higher.Finally to automate generation The building of feature vector progress machine learning model.
Preferably, the balanced rule of target sample of business datum is determined from the balanced rule of candidate samples being provided previously; Business datum is screened using target sample balanced rule;Machine is constructed using the corresponding feature vector of the business datum of screening Device learning model.
In the specific embodiment of the invention, other than aforementioned four characteristic processing, before carrying out model construction, also need Unbalanced sample is handled.It equally can be sample regulating allocation with the configuration flow of above-mentioned characteristic processing rule Corresponding rule adjusts engine implementation using sample and screens to unbalanced business datum.Specifically, imbalanced training sets are Refer to that there are the sample sizes under some or certain feature classifications to be much larger than the sample size under other feature classifications in sample, i.e., Sample size in sample set under each feature classification differs greatly, and is unable to satisfy the building requirement of model.Therefore in building mould Before type, in order to reach the building effect of better machine learning model, need to handle imbalanced training sets problem.Show Example property, for the process flow of business datum before model construction, the rule that can configure in each link is as shown in Figure 3.Wherein, number Candidate rule in Data preprocess, feature coding, feature normalization and Feature Dimension Reduction four processes is as described above, herein no longer It repeats.And it is directed to the processing of imbalanced training sets, stochastical sampling method or synthesis minority class oversampling technique can be configured (Synthetic Minority Oversampling Technique, SMOTE) and editor arest neighbors (Edited Nearest Neighbor, ENN) one or both of combine method realize sample equilibrium.It finally, is equilibrium using the business datum of screening The corresponding feature vector of sample data construct machine learning model.
If the quality of S250, machine learning model is higher than the historical machine learning model of business datum, by the machine In learning model associated target code rule, target normalization rule and target dimensionality reduction rule at least one of be updated to it is described The default configuration rule of business datum.
In the specific embodiment of the invention, by above-mentioned five characteristic processing steps, finally obtained business datum will be used Feature automation platform is fed back in training Optimized model, and according to modelling effect, updates the default configuration rule of each engine.? After model construction, if more preferable by the modelling effect that the feature vector that new configuration rule generates trains, i.e., newest building Machine learning model quality be higher than the business datum historical machine learning model, then by the new configuration rule i.e. new engine At least one in the associated target code rule of learning model, target normalization rule and target dimensionality reduction rule is updated to the industry The default configuration rule for data of being engaged in is come directly to do feature work according to default configuration rule when the new business data of the business next time Journey improves formation efficiency and the accuracy of feature vector.
In conclusion the process of the model training mode based on Feature Engineering automation platform is as shown in Figure 4.Research staff Business datum and its service fields and/business scenario and Feature Engineering need to only be bound, Feature Engineering is in candidate configuration rule On the basis of, it carries out artificial setting configuration parameter or carries out automatic setting configuration parameter according to default configuration rule, so that feature Engineering automates platform and is handled according to the characteristic that automation can be realized in the data processing rule of business datum and configuration, from And the corresponding feature vector of business datum for being able to carry out model training is efficiently obtained, and the accuracy of feature vector is higher. The final training and assessment that model is carried out according to feature vector, and modelling effect is fed back into feature automation platform, it obtains most The corresponding configuration parameter of excellent model is updated to the default configuration rule of each engine.
The technical solution of the present embodiment is in advance industry by the service fields according to belonging to business datum and business scenario Business data provide the corresponding multiple candidate rules of characteristic processing links, thus in the spy automated to business datum When levying vector generation, any rule is picked out from candidate rule and is configured, realize special according to modifying the characteristics of business datum The configuration parameter for levying engineering, the feature vector according to the rule generation business datum determined;And sample data is carried out equal Weighing apparatus processing carries out machine learning model building using the corresponding feature vector of the balanced sample filtered out;Finally by modelling effect Feature Engineering automation platform is fed back to, such business datum is set for the corresponding configuration rule of the highest model of quality with this Default configuration rule.
The embodiment of the present invention reduces the machine learning the relevant technologies threshold of modeling personnel, so that related service personnel may be used To carry out the excavation and modeling of data, it is only necessary to which the allocation optimum angle modification Feature Engineering generated from operational angle or system is matched Parameter is set, business datum is associated with Feature Engineering, automatically generating for the corresponding feature vector of business datum can be completed;And it will The corresponding configuration parameter of the modeling preferable feature vector of effect updates the configuration parameter for being set as Feature Engineering default.Realize spy Modularization, automation and the reusability for levying engineering, improve the formation efficiency and accuracy of feature vector, feature work are greatly decreased The manpower and time cost of journey, the building effect and efficiency of lift scheme.
Embodiment three
Fig. 5 is a kind of structural schematic diagram of the feature deriving means for business datum that the embodiment of the present invention three provides, this reality It applies example to be applicable to carry out business datum the case where Feature Engineering is to generate feature vector, which can realize of the invention any The feature extracting method of business datum described in embodiment.The device specifically includes:
Rule configuration module 510, for determining the target code rule, target normalization rule and target drop of business datum At least one of in dimension rule;Wherein, the target code rule is determined from each candidate code rule being provided previously, The target normalization rule is determined from each candidate feature normalization rule being provided previously, the target dimensionality reduction rule It is to be determined from each candidate dimensionality reduction rule being provided previously;
Feature generation module 520, for target code rule, target normalization rule and the mesh according to the business datum At least one in dimensionality reduction rule is marked, determines the feature vector of the business datum.
Further, described device includes:
Regular supply module 530, for the determining business datum target code rule, target normalization rule and Before at least one in target dimensionality reduction rule, according to service fields belonging to the business datum and/or business scenario, for institute State at least one during business datum offer candidate code is regular, the regular and candidate dimensionality reduction of candidate feature normalization is regular.
Further, described device further include:
Model construction module 540, in the target code rule according to the business datum, target normalization rule Then in target dimensionality reduction rule at least one of, after the feature vector for determining the business datum, using the business datum Feature vector construct machine learning model;
Default rule update module 550, if the quality for the machine learning model is higher than going through for the business datum History machine learning model then drops the associated target code rule of the machine learning model, target normalization rule and target At least one in dimension rule is updated to the default configuration rule of the business datum.
Preferably, the model construction module 540, comprising:
Balanced rule determination unit, for determining the business datum from the balanced rule of candidate samples being provided previously The balanced rule of target sample;
Data screening unit, for being screened using the balanced rule of the target sample to the business datum;
Model construction unit constructs machine learning model for the corresponding feature vector of business datum using screening.
The technical solution of the present embodiment realizes Feature Engineering parameter by the mutual cooperation between each functional module Configuration, being associated with of business datum and Feature Engineering, the automation of feature vector generates, the building of machine learning model, model The functions such as the update of feedback and Feature Engineering default configuration of effect.The embodiment of the present invention makes research staff only need to be from business The configuration parameter for the allocation optimum angle modification Feature Engineering that angle or system generate, business datum is associated with Feature Engineering, Automatically generating for the corresponding feature vector of business datum can be completed;And the corresponding configuration of the preferable feature vector of effect will be modeled Parameter updates the configuration parameter for being set as Feature Engineering default.Modularization, automation and the reusability of Feature Engineering are realized, The formation efficiency and accuracy for improving feature vector, are greatly decreased the manpower and time cost of Feature Engineering, the structure of lift scheme Build effect and efficiency.
Example IV
Fig. 6 is a kind of structural schematic diagram for server that the embodiment of the present invention four provides, and Fig. 6, which is shown, to be suitable for being used to realizing The block diagram of the exemplary servers of embodiment of the embodiment of the present invention.The server that Fig. 6 is shown is only an example, should not be right The function and use scope of the embodiment of the present invention bring any restrictions.
The server 12 that Fig. 6 is shown is only an example, should not function and use scope band to the embodiment of the present invention Carry out any restrictions.
As shown in fig. 6, server 12 is showed in the form of universal computing device.The component of server 12 may include but not Be limited to: one or more processor or processing unit 16, system storage 28 connect different system components (including system Memory 28 and processing unit 16) bus 18.
Bus 18 indicates one of a few class bus structures or a variety of, including memory bus or Memory Controller, Peripheral bus, graphics acceleration port, processor or the local bus using any bus structures in a variety of bus structures.It lifts For example, these architectures include but is not limited to industry standard architecture (ISA) bus, microchannel architecture (MAC) Bus, enhanced isa bus, Video Electronics Standards Association (VESA) local bus and peripheral component interconnection (PCI) bus.
Server 12 typically comprises a variety of computer system readable media.These media can be and any can be serviced The usable medium that device 12 accesses, including volatile and non-volatile media, moveable and immovable medium.
System storage 28 may include the computer system readable media of form of volatile memory, such as arbitrary access Memory (RAM) 30 and/or cache memory 32.Server 12 may further include other removable/nonremovable , volatile/non-volatile computer system storage medium.Only as an example, storage system 34 can be used for reading and writing not removable Dynamic, non-volatile magnetic media (Fig. 6 do not show, commonly referred to as " hard disk drive ").Although being not shown in Fig. 6, can provide Disc driver for being read and write to removable non-volatile magnetic disk (such as " floppy disk "), and to removable anonvolatile optical disk The CD drive of (such as CD-ROM, DVD-ROM or other optical mediums) read-write.In these cases, each driver can To be connected by one or more data media interfaces with bus 18.Memory 28 may include at least one program product, The program product has one group of (for example, at least one) program module, these program modules are configured to perform the embodiment of the present invention The function of each embodiment.
Program/utility 40 with one group of (at least one) program module 42 can store in such as memory 28 In, such program module 42 include but is not limited to operating system, one or more application program, other program modules and It may include the realization of network environment in program data, each of these examples or certain combination.Program module 42 is usual Execute the function and/or method in described embodiment of the embodiment of the present invention.
Server 12 can also be logical with one or more external equipments 14 (such as keyboard, sensing equipment, display 24 etc.) Letter, can also be enabled a user to one or more equipment interact with the server 12 communicate, and/or with make the server The 12 any equipment (such as network interface card, modem etc.) that can be communicated with one or more of the other calculating equipment communicate. This communication can be carried out by input/output (I/O) interface 22.Also, server 12 can also pass through network adapter 20 With one or more network (such as local area network (LAN), wide area network (WAN) and/or public network, such as internet) communication. As shown, network adapter 20 is communicated by bus 18 with other modules of server 12.It should be understood that although not showing in figure Out, can in conjunction with server 12 use other hardware and/or software module, including but not limited to: microcode, device driver, Redundant processing unit, external disk drive array, RAID system, tape drive and data backup storage system etc..
Processing unit 16 by the program that is stored in system storage 28 of operation, thereby executing various function application and Data processing, such as realize the feature extracting method of business datum provided by the embodiment of the present invention.
Embodiment five
The embodiment of the present invention five also provides a kind of computer readable storage medium, be stored thereon with computer program (or For computer executable instructions), it, should for executing a kind of feature extracting method of business datum when which is executed by processor Method includes:
Determine at least one in the target code rule, target normalization rule and target dimensionality reduction rule of business datum; And
According at least one in the target code rule of the business datum, target normalization rule and target dimensionality reduction rule , determine the feature vector of the business datum.
The computer storage medium of the embodiment of the present invention, can be using any of one or more computer-readable media Combination.Computer-readable medium can be computer-readable signal media or computer readable storage medium.It is computer-readable Storage medium for example may be-but not limited to-the system of electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor, device or Device, or any above combination.The more specific example (non exhaustive list) of computer readable storage medium includes: tool There are electrical connection, the portable computer diskette, hard disk, random access memory (RAM), read-only memory of one or more conducting wires (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD- ROM), light storage device, magnetic memory device or above-mentioned any appropriate combination.In this document, computer-readable storage Medium can be any tangible medium for including or store program, which can be commanded execution system, device or device Using or it is in connection.
Computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal, Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including but unlimited In electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be that computer can Any computer-readable medium other than storage medium is read, which can send, propagates or transmit and be used for By the use of instruction execution system, device or device or program in connection.
The program code for including on computer-readable medium can transmit with any suitable medium, including --- but it is unlimited In wireless, electric wire, optical cable, RF etc. or above-mentioned any appropriate combination.
Can with one or more programming languages or combinations thereof come write for execute the embodiment of the present invention operation Computer program code, described program design language include object oriented program language-such as Java, Smalltalk, C++, further include conventional procedural programming language-such as " C " language or similar program design language Speech.Program code can be executed fully on the user computer, partly be executed on the user computer, as an independence Software package execute, part on the user computer part execute on the remote computer or completely in remote computer or It is executed on server.In situations involving remote computers, remote computer can pass through the network of any kind --- packet It includes local area network (LAN) or wide area network (WAN)-is connected to subscriber computer, or, it may be connected to outer computer (such as benefit It is connected with ISP by internet).
Note that above are only the preferred embodiment and institute's application technology principle of the embodiment of the present invention.Those skilled in the art It will be appreciated that the embodiment of the present invention is not limited to specific embodiment described here, it is able to carry out for a person skilled in the art each The apparent variation of kind, readjustment and the protection scope substituted without departing from the embodiment of the present invention.Therefore, although more than passing through Embodiment is described in further detail the embodiment of the present invention, but the embodiment of the present invention is not limited only to the above implementation Example can also include more other equivalent embodiments in the case where not departing from design of the embodiment of the present invention, and the present invention is implemented The range of example is determined by the scope of the appended claims.

Claims (10)

1. a kind of feature extracting method of business datum characterized by comprising
Determine at least one in the target code rule, target normalization rule and target dimensionality reduction rule of business datum;And
Target code rule, target according to the business datum normalize at least one in rule and target dimensionality reduction rule, Determine the feature vector of the business datum.
2. the method according to claim 1, wherein in the target code rule of the determining business datum, mesh Before marking at least one in normalization rule and target dimensionality reduction rule, the method also includes:
According to service fields belonging to the business datum and/or business scenario, candidate code rule are provided for the business datum Then, at least one in candidate feature normalization rule and candidate dimensionality reduction rule.
3. the method according to claim 1, wherein being advised in the target code according to the business datum Then, at least one in target normalization rule and target dimensionality reduction rule, after the feature vector for determining the business datum, institute State method further include:
Machine learning model is constructed using the feature vector of the business datum;
If the quality of the machine learning model is higher than the historical machine learning model of the business datum, by the engineering At least one that the target code of habit model interaction is regular, target normalizes in regular and target dimensionality reduction rule is updated to the industry The default configuration rule for data of being engaged in.
4. according to the method described in claim 3, it is characterized in that, the feature vector using the business datum constructs machine Device learning model, comprising:
The balanced rule of target sample of the business datum is determined from the balanced rule of candidate samples being provided previously;
The business datum is screened using the target sample balanced rule;
Machine learning model is constructed using the corresponding feature vector of the business datum of screening.
5. a kind of feature deriving means of business datum characterized by comprising
Rule configuration module, for determining the target code rule, target normalization rule and target dimensionality reduction rule of business datum At least one of in;
Feature generation module, for target code rule, target normalization rule and the target dimensionality reduction according to the business datum At least one of in rule, determine the feature vector of the business datum.
6. device according to claim 5, which is characterized in that described device includes:
Regular supply module, for the target code rule, target normalization rule and target drop in the determining business datum It is the business according to service fields belonging to the business datum and/or business scenario before at least one in dimension rule Data provide at least one in regular candidate code, candidate feature normalization rule and candidate dimensionality reduction rule.
7. device according to claim 5, which is characterized in that described device further include:
Model construction module, in the target code rule, target normalization rule and mesh according to the business datum At least one in dimensionality reduction rule is marked, after the feature vector for determining the business datum, using the feature of the business datum Vector constructs machine learning model;
Default rule update module, if being higher than the historical machine of the business datum for the quality of the machine learning model Practise model, then it will be in the associated target code rule of the machine learning model, target normalization rule and target dimensionality reduction rule At least one of be updated to the business datum default configuration rule.
8. device according to claim 7, which is characterized in that the model construction module, comprising:
Balanced rule determination unit, for determining the target of the business datum from the balanced rule of candidate samples being provided previously The balanced rule of sample;
Data screening unit, for being screened using the balanced rule of the target sample to the business datum;
Model construction unit constructs machine learning model for the corresponding feature vector of business datum using screening.
9. a kind of server characterized by comprising
One or more processors;
Memory, for storing one or more programs;
When one or more of programs are executed by one or more of processors, so that one or more of processors are real The now feature extracting method of business datum according to any one of claims 1 to 4.
10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is by processor The feature extracting method of business datum according to any one of claims 1 to 4 is realized when execution.
CN201810289688.8A 2018-03-30 2018-03-30 Feature extracting method, device, server and the storage medium of business datum Pending CN110334720A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810289688.8A CN110334720A (en) 2018-03-30 2018-03-30 Feature extracting method, device, server and the storage medium of business datum

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810289688.8A CN110334720A (en) 2018-03-30 2018-03-30 Feature extracting method, device, server and the storage medium of business datum

Publications (1)

Publication Number Publication Date
CN110334720A true CN110334720A (en) 2019-10-15

Family

ID=68139901

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810289688.8A Pending CN110334720A (en) 2018-03-30 2018-03-30 Feature extracting method, device, server and the storage medium of business datum

Country Status (1)

Country Link
CN (1) CN110334720A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111522797A (en) * 2020-04-27 2020-08-11 支付宝(杭州)信息技术有限公司 Method and device for building business model based on business database
CN111581305A (en) * 2020-05-18 2020-08-25 北京字节跳动网络技术有限公司 Feature processing method, feature processing device, electronic device, and medium
CN113010510A (en) * 2019-12-20 2021-06-22 中国移动通信集团安徽有限公司 Service identification method, device and system and computing equipment
CN113158022A (en) * 2021-01-29 2021-07-23 北京达佳互联信息技术有限公司 Service recommendation method, device, server and storage medium
RU2785764C1 (en) * 2019-10-31 2022-12-13 Биго Текнолоджи Пте. Лтд. Information recommendation method, device, recommendation server and storage device

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103176983A (en) * 2011-12-20 2013-06-26 中国科学院计算机网络信息中心 Event warning method based on Internet information
CN103743486A (en) * 2014-01-02 2014-04-23 上海大学 Automatic grading system and method based on mass tobacco leaf data
CN103854063A (en) * 2012-11-29 2014-06-11 中国科学院计算机网络信息中心 Internet open information-based event occurrence risk prediction and early-warning method
CN104156562A (en) * 2014-07-15 2014-11-19 清华大学 Failure predication system and failure predication method for background operation and maintenance system of bank
CN104239856A (en) * 2014-09-04 2014-12-24 电子科技大学 Face recognition method based on Gabor characteristics and self-adaptive linear regression
CN104268595A (en) * 2014-09-24 2015-01-07 深圳市华尊科技有限公司 General object detecting method and system
CN104468711A (en) * 2014-10-31 2015-03-25 上海融军科技有限公司 Universal data management coding method and system for internet of things
CN104933075A (en) * 2014-03-20 2015-09-23 百度在线网络技术(北京)有限公司 User attribute predicting platform and method
CN105302911A (en) * 2015-11-10 2016-02-03 珠海多玩信息技术有限公司 Data screening engine establishing method and data screening engine
CN105426356A (en) * 2015-10-29 2016-03-23 杭州九言科技股份有限公司 Target information identification method and apparatus
CN106682067A (en) * 2016-11-08 2017-05-17 浙江邦盛科技有限公司 Machine learning anti-fraud monitoring system based on transaction data
CN106779087A (en) * 2016-11-30 2017-05-31 福建亿榕信息技术有限公司 A kind of general-purpose machinery learning data analysis platform
CN107025141A (en) * 2017-05-18 2017-08-08 成都海天数联科技有限公司 A kind of dispatching method based on big data mixture operation model
CN107423442A (en) * 2017-08-07 2017-12-01 火烈鸟网络(广州)股份有限公司 Method and system, storage medium and computer equipment are recommended in application based on user's portrait behavioural analysis
CN107463703A (en) * 2017-08-16 2017-12-12 电子科技大学 English social media account number classification method based on information gain

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103176983A (en) * 2011-12-20 2013-06-26 中国科学院计算机网络信息中心 Event warning method based on Internet information
CN103854063A (en) * 2012-11-29 2014-06-11 中国科学院计算机网络信息中心 Internet open information-based event occurrence risk prediction and early-warning method
CN103743486A (en) * 2014-01-02 2014-04-23 上海大学 Automatic grading system and method based on mass tobacco leaf data
CN104933075A (en) * 2014-03-20 2015-09-23 百度在线网络技术(北京)有限公司 User attribute predicting platform and method
CN104156562A (en) * 2014-07-15 2014-11-19 清华大学 Failure predication system and failure predication method for background operation and maintenance system of bank
CN104239856A (en) * 2014-09-04 2014-12-24 电子科技大学 Face recognition method based on Gabor characteristics and self-adaptive linear regression
CN104268595A (en) * 2014-09-24 2015-01-07 深圳市华尊科技有限公司 General object detecting method and system
CN104468711A (en) * 2014-10-31 2015-03-25 上海融军科技有限公司 Universal data management coding method and system for internet of things
CN105426356A (en) * 2015-10-29 2016-03-23 杭州九言科技股份有限公司 Target information identification method and apparatus
CN105302911A (en) * 2015-11-10 2016-02-03 珠海多玩信息技术有限公司 Data screening engine establishing method and data screening engine
CN106682067A (en) * 2016-11-08 2017-05-17 浙江邦盛科技有限公司 Machine learning anti-fraud monitoring system based on transaction data
CN106779087A (en) * 2016-11-30 2017-05-31 福建亿榕信息技术有限公司 A kind of general-purpose machinery learning data analysis platform
CN107025141A (en) * 2017-05-18 2017-08-08 成都海天数联科技有限公司 A kind of dispatching method based on big data mixture operation model
CN107423442A (en) * 2017-08-07 2017-12-01 火烈鸟网络(广州)股份有限公司 Method and system, storage medium and computer equipment are recommended in application based on user's portrait behavioural analysis
CN107463703A (en) * 2017-08-16 2017-12-12 电子科技大学 English social media account number classification method based on information gain

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2785764C1 (en) * 2019-10-31 2022-12-13 Биго Текнолоджи Пте. Лтд. Information recommendation method, device, recommendation server and storage device
CN113010510A (en) * 2019-12-20 2021-06-22 中国移动通信集团安徽有限公司 Service identification method, device and system and computing equipment
CN113010510B (en) * 2019-12-20 2024-03-19 中国移动通信集团安徽有限公司 Service identification method, device, system and computing equipment
CN111522797A (en) * 2020-04-27 2020-08-11 支付宝(杭州)信息技术有限公司 Method and device for building business model based on business database
CN111522797B (en) * 2020-04-27 2023-06-02 支付宝(杭州)信息技术有限公司 Method and device for constructing business model based on business database
CN111581305A (en) * 2020-05-18 2020-08-25 北京字节跳动网络技术有限公司 Feature processing method, feature processing device, electronic device, and medium
CN111581305B (en) * 2020-05-18 2023-08-08 抖音视界有限公司 Feature processing method, device, electronic equipment and medium
CN113158022A (en) * 2021-01-29 2021-07-23 北京达佳互联信息技术有限公司 Service recommendation method, device, server and storage medium
CN113158022B (en) * 2021-01-29 2024-03-12 北京达佳互联信息技术有限公司 Service recommendation method, device, server and storage medium

Similar Documents

Publication Publication Date Title
EP3467723B1 (en) Machine learning based network model construction method and apparatus
JP6708847B1 (en) Machine learning apparatus and method
CN110334720A (en) Feature extracting method, device, server and the storage medium of business datum
CN108804641A (en) A kind of computational methods of text similarity, device, equipment and storage medium
CN104077303B (en) Method and apparatus for data to be presented
CN107169586A (en) Resource optimization method, device and storage medium based on artificial intelligence
CN109726661A (en) Image processing method and device, medium and calculating equipment
CN109035028A (en) Intelligence, which is thrown, cares for strategy-generating method and device, electronic equipment, storage medium
CN111027600A (en) Image category prediction method and device
CN110852785B (en) User grading method, device and computer readable storage medium
CN115730597A (en) Multi-level semantic intention recognition method and related equipment thereof
Bian et al. Research on an artificial intelligence-based professional ability evaluation system from the perspective of industry-education integration
CN108629381A (en) Crowd's screening technique based on big data and terminal device
CN113850666A (en) Service scheduling method, device, equipment and storage medium
CN116402166B (en) Training method and device of prediction model, electronic equipment and storage medium
CN117057852A (en) Internet marketing system and method based on artificial intelligence technology
WO2023164312A1 (en) An apparatus for classifying candidates to postings and a method for its use
CN114168795B (en) Building three-dimensional model mapping and storing method and device, electronic equipment and medium
US11620550B2 (en) Automated data table discovery for automated machine learning
Zhou et al. Data-driven maintenance priority recommendations for civil aircraft engine fleets using reliability-based bivariate cluster analysis
CN111126629A (en) Model generation method, system, device and medium for identifying brushing behavior
CN111259138A (en) Tax field short text emotion classification method and device
CN109858532A (en) A kind of user draws a portrait method, apparatus, readable storage medium storing program for executing and terminal device
CN117708351B (en) Deep learning-based technical standard auxiliary review method, system and storage medium
US11599921B2 (en) System and method for determining an alimentary preparation provider

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20191015

RJ01 Rejection of invention patent application after publication