CN116383659A

CN116383659A - Parameter optimization method and device for machine learning feature engineering

Info

Publication number: CN116383659A
Application number: CN202310384601.6A
Authority: CN
Inventors: 郝伟; 刘加瑞; 陈勇
Original assignee: Anhui Huayun'an Technology Co ltd
Current assignee: Anhui Huayun'an Technology Co ltd
Priority date: 2023-04-06
Filing date: 2023-04-06
Publication date: 2023-07-04

Abstract

The present disclosure provides a parameter optimization method and apparatus for machine learning feature engineering, the method comprising: acquiring a current sample space, and quantifying the importance of the dimension characteristics of a first training sample in the current sample space; descending order sorting is carried out on the dimensions in the first training sample according to the quantization result; in the process of training a neural network model, for the ith training, selecting dimension features of i before sequencing from the first training samples to form second training samples corresponding to the first training samples, and training the neural network model by using the second training samples to generate a target model, wherein i is a natural number, i is less than or equal to n, and n is the feature dimension of the first training samples; and verifying the target model, and selecting a model meeting preset conditions as a final model. In this way, the features can be automatically evaluated, thereby improving the working efficiency and the accuracy of the generated model.

Description

Parameter optimization method and device for machine learning feature engineering

Technical Field

Embodiments of the present disclosure relate generally to the field of machine learning technology, and more particularly, to a parameter optimization method and apparatus for machine learning feature engineering.

Background

The feature engineering is to screen out data features from the original data to improve the training effect of the model. Generally, the first step in the machine learning process is to define a feature set of a sample, and then select an appropriate sample set for training based on the defined feature. This process often goes through a relatively time-consuming tuning process, i.e., researchers need to select and recombine the different possible combinations of possible features of the data to arrive at a better training model that meets the needs. From a mathematical level, it is easy to analyze that n possible combinations of characteristics would have n-! A kind of module is assembled in the module and the module is assembled in the module. Meanwhile, during this analysis, since manual adjustment and combination are required, for n ≡! The possible combinations often require human experience to screen to reduce test space. However, finding a suitable model is often difficult: not only requires a high experience, but also consumes considerable time.

Disclosure of Invention

According to the embodiment of the disclosure, a parameter optimization scheme for machine learning feature engineering is provided, and is used for automatically evaluating features, so that the working efficiency and the accuracy of a generated model are improved.

In a first aspect of the present disclosure, there is provided a parameter optimization method for machine learning feature engineering, comprising:

acquiring a current sample space, and quantifying the importance of the dimension characteristics of a first training sample in the current sample space;

descending order sorting is carried out on the dimensions in the first training sample according to the quantization result;

in the process of training a neural network model, for the ith training, selecting dimension features of i before sequencing from the first training samples to form second training samples corresponding to the first training samples, and training the neural network model by using the second training samples to generate a target model, wherein i is a natural number, i is less than or equal to n, and n is the feature dimension of the first training samples;

and verifying the target model, and selecting a model meeting preset conditions as a final model.

In some embodiments, the importance quantifying the dimensional features of the first training sample in the current sample space includes:

importance quantifying the dimension feature of the first training sample by a sample bias value, wherein the bias index w of the ith dimension feature _i Calculated by:

wherein w is _i For the bias index, m is the number of first training samples in the current sample space.

In some embodiments, the ordering the dimensions in the first training sample in descending order according to the quantization result includes:

and ordering the dimensions in the first training samples in descending order according to the order of the sample deviation values from high to low.

In some embodiments, after the descending order of the dimensions in the first training sample according to the quantization result, further comprising: dividing a first training sample in the current sample space into a training set and a verification set;

and in the process of training the neural network model, selecting dimension characteristics of i before sequencing from the training set to form a second training sample corresponding to the first training sample, and training the neural network model.

In some embodiments, after the generating the target model, the method further comprises:

and verifying the target model by using the verification set.

In some embodiments, said validating said target model with said validation set comprises:

and selecting the accuracy, the precision, the recall or F1 as an evaluation index according to the requirements of practical application, and verifying the target model by using the verification set.

In some embodiments, the verifying the target model, selecting a model that meets a preset condition as a final model includes:

and selecting a target model with model identification accuracy greater than a preset threshold as a final target model according to the result of verifying the target model by using the verification set for the generated multiple target models.

In a second aspect of the present disclosure, there is provided a parameter optimization apparatus for machine learning feature engineering, comprising:

the sample space acquisition module is used for acquiring a current sample space and carrying out importance quantification on the dimension characteristics of a first training sample in the current sample space;

the dimension sorting module is used for sorting the dimensions in the first training sample in a descending order according to the quantization result;

the model training module is used for selecting dimension characteristics of i before sequencing from the first training samples to form second training samples corresponding to the first training samples for the ith training in the process of training the neural network model, and training the neural network model by using the second training samples to generate a target model, wherein i is a natural number, i is less than or equal to n, and n is the characteristic dimension of the first training samples;

and the model verification module is used for verifying the target model and selecting a model meeting preset conditions as a final model.

In a third aspect of the present disclosure, there is provided an electronic device comprising a memory having a computer program stored thereon and a processor that when executing the program implements the method as described above.

In a fourth aspect of the present disclosure, a computer readable storage medium is provided, on which a computer program is stored, which program, when being executed by a processor, implements a method as described above.

Through the parameter optimization method for the machine learning feature engineering, the features can be automatically evaluated, so that the working efficiency and the accuracy of the generated model are improved.

The matters described in the summary section are not intended to limit key or critical features of the embodiments of the present disclosure nor to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The above and other features, advantages and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, wherein like or similar reference numerals denote like or similar elements, in which:

FIG. 1 illustrates a flow chart of a parameter optimization method for machine learning feature engineering in accordance with an embodiment of the present disclosure;

FIG. 2 shows a schematic structural diagram of a parameter optimization device for machine learning feature engineering in accordance with a second embodiment of the present disclosure;

fig. 3 shows a schematic block diagram of an electronic device used to implement an embodiment of the present disclosure.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are some embodiments of the present disclosure, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the disclosure, are within the scope of the disclosure.

In addition, the term "and/or" herein is merely an association relationship describing an association object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.

The parameter optimization method for the machine learning feature engineering can realize the automation of a feature evaluation process, and can effectively select proper feature parameters through the process to serve as a feature engineering model of a system, so that the manual evaluation process is realized by fully utilizing a computer, the working efficiency is greatly improved, and meanwhile, the accuracy of the model is also effectively improved.

Specifically, as shown in fig. 1, a flowchart of a parameter optimization method for machine learning feature engineering according to an embodiment of the present disclosure is shown. As an optional embodiment of the disclosure, in this embodiment, the parameter optimization method for machine learning feature engineering may include the following steps:

s101: and acquiring a current sample space, and carrying out importance quantification on the dimension characteristics of a first training sample in the current sample space.

The parameter optimization method for machine learning feature engineering of the embodiment of the disclosure can be applied to feature engineering, and in particular can be applied to a feature evaluation process of feature engineering. The feature engineering is to screen out data features from the original data to improve the training effect of the model. Generally, the first step in the machine learning process is to define a feature set of a sample, and then select an appropriate sample set for training based on the defined feature. This process often goes through a relatively time-consuming tuning process, i.e., researchers need to select and recombine the different possible combinations of possible features of the data to arrive at a better training model that meets the needs. From a mathematical level, it is easy to analyze that n possible combinations of characteristics would have n-! A kind of module is assembled in the module and the module is assembled in the module. Meanwhile, during this analysis, since manual adjustment and combination are required, for n ≡! The possible combinations often require human experience to screen to reduce test space. However, finding a suitable model is often difficult, not only requiring a high degree of experience, but also consuming a considerable amount of time.

To this end, the present disclosure provides a parameter optimization method for machine learning feature engineering for improving work efficiency and model accuracy.

In the process of training a model by utilizing feature engineering, a sample space is often needed, and without losing generality, the technical scheme of the disclosure is described by taking a sample space as an example (namely a current sample space).

First, a current sample space is acquired, which includes a plurality of training samples, where each sample may be, for example, a vector including a plurality of latitudes, and for samples that are not represented in a vector form, samples that are quantized into a vector form may be first.

After the current sample space is acquired, the importance of the training samples (denoted as the first training samples) in the current sample space is quantified by using the method in this embodiment. Specifically, the dimension features of the first training sample may be importance quantified by a sample bias value, wherein the bias index w of the i-th dimension feature _i Calculated by:

S102: and ordering the dimensions in the first training sample in a descending order according to the quantization result.

In this embodiment, after importance quantization is performed on the dimension features of the first training sample, the dimensions in the first training sample may be further sorted in descending order according to the quantization result.

Specifically, the dimensions in the first training sample may be sorted in descending order according to the order of the sample offset values from high to low. For example, the sample space a= { x for an n-dimensional vector x ₁ ,x ₂ ,...,x _m I.e., where a=m, i.e., there are m samples in space.Each sample x has n features, which can be expressed as: x is x _i ＝(x ⁽¹⁾ ,x ⁽²⁾ ,...,x ⁿ ). The importance of the current dimension is quantified, e.g., the deviation index wi of the i-th dimension feature can be expressed using the following formula:

by calculating the data of each dimension by using the formula, the deviation index of each latitude can be effectively calculated, and thus the deviation index can be used as an evaluation basis of the feature importance.

After the bias index ordering, the important dimensions can be arranged in the front position.

S103: in the training process of the neural network model, for the ith training, selecting dimension features of i before sequencing from the first training samples to form second training samples corresponding to the first training samples, and training the neural network model by using the second training samples to generate a target model, wherein i is a natural number, i is less than or equal to n, and n is the feature dimension of the first training samples.

For a sample with n features, n training runs were performed, each based on: and training the ith training, namely training the first i features which are sequenced according to the deviation values by using a training set to obtain n trained models.

S104: and verifying the target model, and selecting a model meeting preset conditions as a final model.

After the target model is generated, the target model can be verified, and a model meeting preset conditions is selected as a final model. For example, the accuracy, precision, recall or F1 may be selected as an evaluation index according to the requirements of the practical application, and the verification set is used to verify the target model.

Furthermore, as an optional embodiment of the disclosure, after the descending order of the dimensions in the first training sample according to the quantization result, the method further includes: dividing a first training sample in the current sample space into a training set and a verification set;

After the generating the target model, the method further comprises: and verifying the target model by using the verification set.

The verifying the target model, selecting a model meeting preset conditions as a final model, comprises the following steps:

According to the method, the training set and the verification set are set, so that the accuracy of the model is further improved.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present disclosure is not limited by the order of actions described, as some steps may take other order or occur simultaneously in light of the present disclosure. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all alternative embodiments, and that the acts and modules referred to are not necessarily required by the present disclosure.

The foregoing is a description of embodiments of the method, and the following further describes embodiments of the present disclosure through examples of apparatus.

As shown in fig. 2, a parameter optimization apparatus for machine learning feature engineering according to a second embodiment of the present disclosure includes:

a sample space obtaining module 201, configured to obtain a current sample space, and perform importance quantization on dimension features of a first training sample in the current sample space;

a dimension sorting module 202, configured to sort dimensions in the first training sample in descending order according to a quantization result;

the model training module 203 is configured to, in a training process of the neural network model, select, for an ith training, dimension features of i before sorting from the first training samples to form a second training sample corresponding to the first training sample, and train the neural network model by using the second training sample to generate a target model, where i is a natural number, i is less than or equal to n, and n is a feature dimension of the first training sample;

the model verification module 204 is configured to verify the target model, and select a model that meets a preset condition as a final model.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the described modules may refer to corresponding procedures in the foregoing method embodiments, which are not described herein again.

Fig. 3 shows a schematic block diagram of an electronic device 300 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

The electronic device 300 includes a computing unit 301 that can perform various appropriate actions and processes according to a computer program stored in a ROM302 or a computer program loaded from a storage unit 308 into a RAM 303. In the RAM303, various programs and data required for the operation of the electronic device 300 may also be stored. The computing unit 301, the ROM302, and the RAM303 are connected to each other by a bus 304. I/O interface 305 is also connected to bus 304.

Various components in the electronic device 300 are connected to the I/O interface 305, including: an input unit 306 such as a keyboard, a mouse, etc.; an output unit 307 such as various types of displays, speakers, and the like; a storage unit 308 such as a magnetic disk, an optical disk, or the like; and a communication unit 309 such as a network card, modem, wireless communication transceiver, etc. The communication unit 309 allows the electronic device 300 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 301 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 301 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 301 performs the various methods and processes described above, such as a parameter optimization method for machine learning feature engineering. For example, in some embodiments, the parameter optimization method for machine learning feature engineering may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 308. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 300 via the ROM302 and/or the communication unit 309. When the computer program is loaded into RAM303 and executed by computing unit 301, one or more of the steps of the parameter optimization method for machine learning feature engineering described above may be performed. Alternatively, in other embodiments, the computing unit 301 may be configured to perform the parameter optimization method for machine learning feature engineering in any other suitable way (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems-on-chips (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: display means for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A method for optimizing parameters for machine learning feature engineering, comprising:

2. The method of claim 1, wherein said importance quantifying the dimensional features of the first training sample in the current sample space comprises:

3. The method of claim 1, wherein the ordering the dimensions in the first training samples in descending order according to the quantization result comprises:

4. A method of optimizing parameters according to claim 3, further comprising, after said ordering of dimensions in said first training samples in descending order according to quantization results: dividing a first training sample in the current sample space into a training set and a verification set;

5. The method of parameter optimization of claim 4, wherein after the generating the target model, the method further comprises:

and verifying the target model by using the verification set.

6. The method of claim 5, wherein validating the object model using the validation set comprises:

7. The method for optimizing parameters according to claim 6, wherein verifying the target model, selecting a model satisfying a preset condition as a final model, comprises:

8. Parameter optimization apparatus for machine learning feature engineering, characterized by comprising:

9. An electronic device comprising a memory and a processor, the memory having stored thereon a computer program, characterized in that the processor, when executing the program, implements the method of any of claims 1-7.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any one of claims 1-7.