CN114816506A - Model feature rapid processing method and device, storage medium and electronic equipment - Google Patents
Model feature rapid processing method and device, storage medium and electronic equipment Download PDFInfo
- Publication number
- CN114816506A CN114816506A CN202210425982.3A CN202210425982A CN114816506A CN 114816506 A CN114816506 A CN 114816506A CN 202210425982 A CN202210425982 A CN 202210425982A CN 114816506 A CN114816506 A CN 114816506A
- Authority
- CN
- China
- Prior art keywords
- processing
- configuration
- features
- feature
- subfiles
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000003672 processing method Methods 0.000 title claims description 10
- 238000012545 processing Methods 0.000 claims abstract description 212
- 238000000034 method Methods 0.000 claims abstract description 80
- 230000008569 process Effects 0.000 claims description 23
- 238000012549 training Methods 0.000 claims description 22
- 230000006870 function Effects 0.000 claims description 16
- 238000004590 computer program Methods 0.000 claims description 15
- 238000000926 separation method Methods 0.000 claims description 13
- 238000010606 normalization Methods 0.000 claims description 10
- 238000005070 sampling Methods 0.000 description 14
- 238000004364 calculation method Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 238000006243 chemical reaction Methods 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000007792 addition Methods 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000005055 memory storage Effects 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000005215 recombination Methods 0.000 description 1
- 230000006798 recombination Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000004904 shortening Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/70—Software maintenance or management
- G06F8/71—Version control; Configuration management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
- G06F18/2148—Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the process organisation or structure, e.g. boosting cascade
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/30—Creation or generation of source code
- G06F8/35—Creation or generation of source code model driven
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/44—Encoding
- G06F8/443—Optimisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/60—Software deployment
- G06F8/61—Installation
- G06F8/63—Image based installation; Cloning; Build to order
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Computer Security & Cryptography (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Image Analysis (AREA)
Abstract
The embodiment of the invention discloses a method and a device for rapidly processing model characteristics, a storage medium and electronic equipment, wherein the method comprises the following steps: determining a plurality of function modules corresponding to different processing types of model characteristics, and respectively generating configuration subfiles to be configured aiming at the function modules; classifying the features in the data set to be processed according to the feature types; determining corresponding configuration subfiles to be configured according to different types of features, acquiring configuration parameters of each configuration subfile, and generating a complete configuration file; and processing the corresponding type of features based on the generated complete configuration file.
Description
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a method and a device for quickly processing model characteristics, a storage medium and electronic equipment.
Background
With the popularization of artificial intelligence, more and more business scenes need to be assisted by an algorithm model, wherein features play a decisive role in the model. The processing of the features occupies most of workload in the whole modeling process, and the consistency of the feature calculation logic and the feature calculation logic during training is also guaranteed for the model after the model is online.
At present, in the modeling process, a developer needs to pay attention to processing logic, the requirement on the service level of the developer is high, the workload is large, and in order to ensure the consistency of the feature calculation logic and the feature calculation logic during training of an online prediction model, a code is developed once during training and then developed once during online prediction, so that time and labor are wasted.
Disclosure of Invention
In order to overcome the defects in the prior art, the present invention provides a method and an apparatus for rapidly processing model features, a storage medium, and an electronic device, so that a developer only needs to pay attention to processing steps and not to pay attention to processing logic when training the model processing features each time, and the purpose of processing the model features can be achieved only by configuring corresponding configuration files.
Another object of the present invention is to provide a method and an apparatus for fast processing model features, a storage medium and an electronic device, wherein a processing mirror image is generated in a model feature processing procedure, so that the computation logic of the model features after online is consistent with the feature computation logic during training.
In order to achieve the above object, the present invention provides a method for rapidly processing model features, comprising the following steps:
determining a plurality of function modules corresponding to different processing types of model characteristics, and respectively generating configuration subfiles to be configured aiming at the function modules;
classifying the features in the data set to be processed according to the feature types;
determining corresponding configuration subfiles to be configured according to different types of features, acquiring configuration parameters of each configuration subfile, and generating a complete configuration file;
and processing the corresponding type of features based on the generated complete configuration file.
Optionally, in the above method embodiments of the present invention, the determining a functional module of the plurality of functional modules corresponding to different processing types of the model feature includes missing value processing, normalization processing, binning processing, and encoding processing.
Optionally, in the above method embodiments of the present invention, the configuration subfile to be configured has configuration parameters therein to provide configuration for a developer.
Optionally, in each of the method embodiments of the present invention, in the step of classifying the features in the data set to be processed according to feature types, the features in the data set to be processed are classified into numerical value type features and category type features, so as to generate a numerical value type feature list and a category type feature list.
Optionally, in the foregoing method embodiments of the present invention, in the step of determining, for different types of features, corresponding configuration subfiles to be configured, obtaining configuration parameters of each configuration subfile, and generating a complete configuration file, the configuration subfiles to be configured, which are required for performing feature processing, are determined for the numerical type features, the configuration parameters are obtained based on the determined configuration subfiles to be configured, and the configuration subfiles for which the configuration parameters are obtained are integrated, so as to obtain the complete configuration file for processing the numerical type features.
Optionally, in the foregoing method embodiments of the present invention, for the value type feature, the determined configuration subfiles to be configured include a configuration subfile to be configured for missing value processing, a configuration subfile to be configured for normalization processing, and a configuration subfile to be configured for binning processing.
Optionally, in the foregoing method embodiments of the present invention, in the step of determining, for different types of features, corresponding configuration subfiles to be configured, obtaining configuration parameters of each configuration subfile, and generating a complete configuration file, the configuration subfiles to be configured, which are required for performing feature processing, are determined for the type features, the configuration parameters are obtained based on the determined configuration subfiles to be configured, and the configuration subfiles for which the configuration parameters are obtained are integrated, so as to obtain the complete configuration file for processing the type features.
Optionally, in the foregoing method embodiments of the present invention, for the class-type feature, the determined configuration subfiles to be configured include a configuration subfile to be configured for missing value processing and a configuration subfile to be configured for encoding processing.
Optionally, in each of the above method embodiments of the present invention, after processing the corresponding type of feature based on the generated complete configuration file, the method further includes:
and generating a processing mirror image of the characteristic processing process in the training phase.
In order to achieve the above object, the present invention further provides a model feature rapid processing apparatus, including:
the configuration subfile generating module to be configured is used for determining a plurality of function modules corresponding to different processing types of the model characteristics and respectively generating the configuration subfiles to be configured aiming at the realization of each function module;
the characteristic classification module is used for classifying the characteristics in the data set to be processed according to the characteristic types;
the configuration file generation module is used for determining corresponding configuration subfiles to be configured according to different types of characteristics, acquiring configuration parameters of each configuration subfile and generating a complete configuration file;
and the feature processing module is used for processing the features of the corresponding types based on the configuration file generated by the configuration file generating module.
To achieve the above object, the present invention further provides a storage medium having a computer program stored thereon, wherein the computer program is executed by a processor to implement the steps of the above model feature fast processing method.
In order to achieve the above object, the present invention further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the model feature fast processing method when executing the computer program.
Compared with the prior art, the method and the device for rapidly processing the model features, the storage medium and the electronic equipment perform feature processing in a configuration file mode, so that developers only need to pay attention to processing steps and processing logic when training the model to process the features each time, and only need to configure corresponding configuration files, thereby greatly shortening the flow of feature processing. According to the invention, the model feature processing process is automatically packaged into the mirror image file, so that secondary development of training and online is avoided, and the consistency of the front and back processing modes is ensured.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
The above and other objects, features and advantages of the present invention will become more apparent by describing in more detail embodiments of the present invention with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings, like reference numbers generally represent like parts or steps.
FIG. 1 is a flowchart illustrating a method for fast processing model features according to an exemplary embodiment of the present invention;
FIG. 2 is a schematic structural diagram of a device for rapidly processing model features according to an exemplary embodiment of the present invention
Fig. 3 is a structure of an electronic device according to an exemplary embodiment of the present invention.
Detailed Description
Hereinafter, example embodiments according to the present invention will be described in detail with reference to the accompanying drawings. It is to be understood that the described embodiments are merely a subset of embodiments of the invention and not all embodiments of the invention, with the understanding that the invention is not limited to the example embodiments described herein.
It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.
It will be understood by those skilled in the art that the terms "first", "second", etc. in the embodiments of the present invention are used only for distinguishing different steps, devices or modules, etc., and do not denote any particular technical meaning or necessarily order therebetween.
It should also be understood that in embodiments of the present invention, "a plurality" may refer to two or more and "at least one" may refer to one, two or more.
It is also to be understood that any reference to any component, data, or structure in the embodiments of the invention may be generally understood as one or more, unless explicitly defined otherwise or stated to the contrary hereinafter.
In addition, the term "and/or" in the present invention is only one kind of association relationship describing the associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In the present invention, the character "/" generally indicates that the preceding and following related objects are in an "or" relationship.
It should also be understood that the description of the embodiments of the present invention emphasizes the differences between the embodiments, and the same or similar parts may be referred to each other, so that the descriptions thereof are omitted for brevity.
Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail, but are intended to be part of the specification where appropriate.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be discussed further in subsequent figures.
Embodiments of the invention are operational with numerous other general purpose or special purpose computing system environments or configurations, and with numerous other electronic devices, such as terminal devices, computer systems, servers, etc. Examples of well known terminal devices, computing systems, environments, and/or configurations that may be suitable for use with electronic devices, such as terminal devices, computer systems, servers, and the like, include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, network personal computers, small computer systems, mainframe computer systems, and distributed cloud computing environments that include any of the above, and the like.
Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.
Exemplary method
Fig. 1 is a schematic flowchart of a method for rapidly processing model features according to an exemplary embodiment of the present invention. The embodiment can be applied to electronic equipment, and as shown in fig. 1, the method for rapidly processing the model features of the invention comprises the following steps:
step 101, determining a plurality of function modules corresponding to different processing types of model features, and generating a configuration subfile to be configured for implementation of each function module.
In the modeling process, the processing of the model characteristics mainly comprises missing value processing of the characteristics, standardized processing and box separation processing of the numerical value type characteristics and coding processing of the class type characteristics, and therefore, in the embodiment of the invention, the processing of the model characteristics is divided into a plurality of functional modules corresponding to different processing types of the model characteristics, namely a missing value processing module, a standardized processing module, a box separation processing module and a coding processing module, and corresponding configuration sub-files are respectively generated for the functional modules so as to realize the corresponding functions of the functional modules.
The missing value processing module is configured to perform filling processing on a missing value of a model feature, generally speaking, the filling processing of the missing value needs to consider configuration parameters such as a filling mode, whether to specify numerical filling, filling of a numerical type feature, and a filling mode of a category type feature, where the filling mode includes two modes, i.e., a simple sampling mode and an knn algorithm sampling mode, a user may configure one of the two modes as needed, and fill the missing value according to a sampling result of a corresponding sampling mode, for example, for the feature a: 1, 2, empty, 4, 2, assuming that the configured filling mode is simple sampling, when the sampling result of the system simple sampling is 2, the filling missing value (empty) is 2, and finally the result of processing the missing value of the feature a by the configuration subfile is 1, 2, 2, 4, 2, the filling mode for the numerical value type feature can be configured with average value filling, median filling and frequency filling, and the filling mode for the category type feature can be selected with frequency filling, so that for the missing value processing module, the configuration subfile to be configured is generated for the configuration parameters.
In the embodiment of the present invention, an example of a configuration subfile to be configured by a missing value processing module is as follows:
the normalization module is used for normalizing the value type features, generally, the configuration parameters of the normalization process for the value type features mainly include a processing mode and a feature column to be normalized, the configuration parameters of the processing mode can be configured as maximum absolute value scaling, minimum maximum value scaling and standardized scaling, the processing mode of maximizing the absolute value scaling is to firstly obtain the maximum value of each feature column, then dividing each feature by the maximum value of the corresponding feature column to scale the feature to the range of [ -1,1], wherein the minimum value is maximized and scaled by firstly acquiring the maximum value and the minimum value of each feature column, then, subtracting the feature of the corresponding column from the maximum value of each feature column, and dividing the feature by the difference between the maximum value and the minimum value to scale the feature to a [0,1] interval, wherein the processing mode of the standardized scaling is the standardized scaling: according to the method, the mean value and normalization operation are carried out on the features, the processed features conform to the standard normal distribution, the mean value is 0, the standard deviation is 1, and the configuration parameters of the feature columns needing to be standardized are used for configuring the feature columns needing to be standardized in the model features.
In the embodiment of the present invention, an example of the configuration subfile to be configured by the normalization processing module is as follows:
the binning processing module is used for binning the numerical characteristics, and the binning processing refers to dividing a continuous value into a plurality of sections, and the value of each section is regarded as a classification, for example, the age can be divided into the following sections after binning processing: for juveniles, young, middle-aged and old people, the results after the box separation are selected to be subjected to 0-1 coding or numerical coding, and the width of each box separation can be selected to be consistent or the characteristic number of each box separation can be selected to be consistent by a box separation strategy. The configuration parameters for the binning processing include a feature column required to be binned, a method for coding the conversion result, and a strategy for defining the bin width, wherein the configuration parameters for the method for coding the conversion result can be configured as one of onehot, onehot-dense, and edit, the onehot method is to code the converted result by one-hot coding and return a sparse matrix, the ignored features are always superimposed to the right, and the onehot-dense method is to perform single hot coding on the converted result and return a dense array. Ignored features are always stacked on the right, the ordinal method refers to returning bin identifiers encoded as integers, the configuration parameters of the policy used to define the bin width can be configured as one of uniform, quantile, and kmeans, the uniform policy is that all bins in each feature have the same width, the quantile policy is that all bins in each feature have the same number of points, and the kmeans policy is that the value in each bin has the same nearest center of a one-dimensional k-means cluster.
In the embodiment of the present invention, an example of the configuration subfile to be configured by the binning processing module is as follows:
the encoding processing module is configured to perform encoding processing on the class-type features, so that the class-type features can participate in numerical calculation, and for configuration parameters of the encoding processing including a feature column to be encoded and an encoding mode, the configuration parameters of the encoding mode can be configured as 0-1 encoding and class encoding, where 0-1 encoding, that is, one-hot, refers to performing 0-1 encoding on the features, for example: the sex characteristics are as follows: male and female, if the sex characteristic of one of the data is male, its code is [1,0], the category code, i.e. ordinal, means that the characteristic is category-coded, the code is self-increment from 1, for example: the sex characteristics are as follows: male and female, if one of the data is gender specific, it is encoded to be 2.
In the embodiment of the present invention, an example of the configuration subfile to be configured by the encoding processing module is as follows:
and 102, classifying the features in the data set to be processed according to the feature types.
In the embodiment of the present invention, the model features included in the data set include numerical type features and category features, and therefore, for a data set to be processed, initialization processing is performed first, the model features in the data set to be processed are classified according to feature types, the numerical type features and the category features in the data set to be processed are determined, a numerical type feature list is generated according to the determined numerical type features, and a category feature list is generated according to the determined category features.
Step 103, determining corresponding configuration subfiles to be configured according to different types of features, acquiring configuration parameters of each configuration subfile, and generating a complete configuration file.
In the embodiment of the invention, for the numerical value type features in the numerical value type feature list, the configuration subfiles to be configured, which are required for feature processing, are determined, that is, the configuration subfile to be configured for missing value processing, the configuration subfile to be configured for standardized processing and the configuration subfile to be configured for box separation processing, based on the determined configuration subfile to be configured, developers configure corresponding configuration parameters according to needs, after the developers configure the parameters, the configuration parameters configured by the developers are obtained, and the configuration subfiles after the configuration parameters are obtained are integrated to obtain a complete configuration file for processing the numerical value type features.
Similarly, for the class type features in the class type feature list, determining the configuration subfiles to be configured required for feature processing, namely the configuration subfiles to be configured for missing value processing and the configuration subfiles to be configured for encoding processing, based on the determined configuration subfiles to be configured, configuring corresponding configuration parameters by developers according to needs, acquiring the configuration parameters configured by the developers after the developers configure the parameters, and integrating the configuration subfiles after the configuration parameters are acquired to obtain a complete configuration file for processing the class type features.
And step 104, processing the characteristics of the corresponding types based on the configuration file generated in the step 103.
After generating a complete configuration file, the configuration file only defines how to process and how to process the features, and the generated complete configuration file and the corresponding features need to be input into a processing system for feature processing. Specifically, for the features in the numerical value type feature list, inputting a complete configuration file for processing the numerical value type features into a processing system, and processing the features in the numerical value type feature list one by one based on the complete configuration file for processing the numerical value type features so as to achieve the purpose of rapid feature processing; and inputting a complete configuration file for processing the class type features into the processing system for the class type features in the class type feature list, and processing the features in the class type feature list one by one on the basis of the complete configuration file for processing the class type features, thereby achieving the purpose of rapid feature processing.
Optionally, in an embodiment of the present invention, after step 104, a method for quickly processing model features further includes:
step 105, generating a processing mirror image of the feature processing process in the step 104 in the training stage.
Specifically, in the feature processing procedure of step 104, the system state in the process of inputting the configuration file into the processing system and performing feature processing based on the configuration file to obtain the feature processing result may be packaged into an image file.
Generally, there are two processes for feature processing, namely training and prediction, and the training and prediction inputs raw data, and in order to make the results generated by the training and prediction the same, the processing modes must be consistent. For example, for missing value filling, a certain feature is filled according to frequency, a large amount of data exists during training, which frequency is the highest, and only one piece of data exists during prediction, so that the prediction is dependent on the training process.
Exemplary devices
Fig. 2 is a schematic structural diagram of a model feature fast processing apparatus according to an exemplary embodiment of the present invention. As shown in fig. 2, the apparatus for fast processing model features of this embodiment includes:
a to-be-configured configuration subfile generating module 201, configured to determine a plurality of function modules corresponding to different processing types of the model features, and generate to-be-configured configuration subfiles for implementation of each function module.
In the modeling process, the processing of the model characteristics mainly comprises missing value processing of the characteristics, standardized processing and binning processing of logarithmic value type characteristics and coding processing of class type characteristics, and therefore in the embodiment of the invention, the processing of the model characteristics is divided into a plurality of functional modules corresponding to different processing types of the model characteristics, namely a missing value processing module, a standardized processing module, a binning processing module and a coding processing module, and corresponding configuration sub-files are generated for the functional modules respectively to realize the corresponding functions of the functional modules.
The missing value processing module is configured to perform filling processing on a missing value of a model feature, generally speaking, the filling processing of the missing value needs to consider configuration parameters such as a filling manner, whether to specify numerical filling, filling of a numerical type feature, and a filling manner of a category type feature, where the filling manner includes two manners, namely, a simple sampling manner and an knn algorithm sampling manner, a user may configure one of the manners as needed, and fill the missing value according to a sampling result of a corresponding sampling manner, for example, for a feature a: 1, 2, null, 4, 2, assuming that the configured filling mode is simple sampling, when the sampling result of the system simple sampling is 2, the filling missing value (null) is 2, and finally the result of processing the missing value of the feature a by the configuration subfile is 1, 2, 2, 4, 2, and the filling mode for the numerical value type feature can be configured with average value filling, median filling and frequency filling, and the filling mode for the category type feature can be selected with frequency filling, so that for the missing value processing module, the configuration subfile to be configured is generated according to the configuration parameters.
The normalization module is used for normalizing the value type features, generally, the configuration parameters of the normalization process for the value type features mainly include a processing mode and a feature column to be normalized, the configuration parameters of the processing mode can be configured as maximum absolute value scaling, minimum maximum value scaling and standardized scaling, the processing mode of maximizing the absolute value scaling is to firstly obtain the maximum value of each feature column, then dividing each feature by the maximum value of the corresponding feature column to scale the feature to the range of [ -1,1], wherein the minimum value is maximized and scaled by firstly acquiring the maximum value and the minimum value of each feature column, then, subtracting the feature of the corresponding column from the maximum value of each feature column, and dividing the feature by the difference between the maximum value and the minimum value to scale the feature to a [0,1] interval, wherein the processing mode of the standardized scaling is the standardized scaling: according to the method, the mean value and normalization operation are carried out on the features, the processed features conform to the standard normal distribution, the mean value is 0, the standard deviation is 1, and the configuration parameters of the feature columns needing to be standardized are used for configuring the feature columns needing to be standardized in the model features.
The binning processing module is used for performing binning processing on the numerical characteristics, the binning processing refers to dividing a continuous value into a plurality of segments, and the value of each segment is regarded as a classification, for example, the age can be divided into the following categories after being subjected to binning processing: for teenagers, young adults, middle-aged adults and old people, 0-1 coding or numerical coding is selected for the result after the box separation, and the width of each box separation or the characteristic number of each box separation can be selected to be consistent by a box separation strategy. The configuration parameters for the binning processing include a feature column required to be binned, a method for coding the conversion result, and a strategy for defining the bin width, wherein the configuration parameters for the method for coding the conversion result can be configured as one of onehot, onehot-dense, and edit, the onehot method is to code the converted result by one-hot coding and return a sparse matrix, the ignored features are always superimposed to the right, and the onehot-dense method is to perform single hot coding on the converted result and return a dense array. Ignored features are always stacked on the right, the edit method refers to returning bin identifiers encoded as integers, the configuration parameters of the policy used to define bin width can be configured as one of unifonm, quantile, and kmeans, the unifonm policy is that all bins in each feature have the same width, the quantile policy is that all bins in each feature have the same number of points, and the kmeans policy is that the value in each bin has the same nearest center of a one-dimensional k-means cluster.
The encoding processing module is configured to encode the class-type feature so that the class-type feature can participate in numerical calculation, and for configuration parameters of the encoding processing including a feature column to be encoded and an encoding mode, the configuration parameters of the encoding mode can be configured as 0-1 encoding and class encoding, where 0-1 encoding refers to 0-1 encoding of the feature, for example: the sex characteristics are as follows: male and female, if the sex characteristic of one of the data is male, the coding is [1,0], and the class coding means class coding the characteristic, wherein the coding is increased from 1, for example: the sex characteristics are as follows: male and female, if one of the data is gender specific, it is encoded to be 2.
The feature classification module 202 classifies features in the dataset to be processed according to feature types.
In the embodiment of the present invention, the model features included in the data set include numerical type features and category features, and therefore, for a data set to be processed, initialization processing is performed first, the model features in the data set to be processed are classified according to feature types, the numerical type features and the category features in the data set to be processed are determined, a numerical type feature list is generated according to the determined numerical type features, and a category feature list is generated according to the determined category features.
The configuration file generating module 203 is configured to determine corresponding configuration subfiles to be configured for different types of features, obtain configuration parameters of each configuration subfile, and generate a complete configuration file.
In the embodiment of the invention, for the numerical value type features in the numerical value type feature list, the configuration subfiles to be configured, which are required for feature processing, are determined, that is, the configuration subfile to be configured for missing value processing, the configuration subfile to be configured for standardized processing and the configuration subfile to be configured for box separation processing, based on the determined configuration subfile to be configured, developers configure corresponding configuration parameters according to needs, after the developers configure the parameters, the configuration parameters configured by the developers are obtained, and the configuration subfiles after the configuration parameters are obtained are integrated to obtain a complete configuration file for processing the numerical value type features.
Similarly, for the class type features in the class type feature list, determining the configuration subfiles to be configured required for feature processing, namely the configuration subfiles to be configured for missing value processing and the configuration subfiles to be configured for encoding processing, based on the determined configuration subfiles to be configured, configuring corresponding configuration parameters by developers according to needs, acquiring the configuration parameters configured by the developers after the developers configure the parameters, and integrating the configuration subfiles after the configuration parameters are acquired to obtain a complete configuration file for processing the class type features.
And the feature processing module 204 is used for processing the features of the corresponding types based on the configuration file generated by the configuration file generating module 203.
After generating a complete configuration file, the configuration file only defines how to process and how to process the features, and the generated complete configuration file and the corresponding features need to be input into a processing system for feature processing. Specifically, for the features in the numerical value type feature list, inputting a complete configuration file for processing the numerical value type features into a processing system, and processing the features in the numerical value type feature list one by one based on the complete configuration file for processing the numerical value type features so as to achieve the purpose of rapid feature processing; and inputting a complete configuration file for processing the class type features into the processing system for the class type features in the class type feature list, and processing the features in the class type feature list one by one on the basis of the complete configuration file for processing the class type features, so that the aim of rapid feature processing is fulfilled.
Optionally, in an embodiment of the present invention, a device for quickly processing model features further includes:
and the image file generation module 205 is used for generating a processing image of the feature processing process of the feature processing module 204 in the training stage.
Specifically, in the feature processing process of step 104 in the training stage, the system state in the process of inputting the configuration file into the processing system and performing feature processing based on the configuration file to obtain the feature processing result may be packaged into an image file.
Generally, there are two processes for feature processing, namely training and prediction, and the training and prediction inputs raw data, and in order to make the results generated by the training and prediction the same, the processing modes must be consistent. For example, for missing value filling, a certain feature is filled according to frequency, a large amount of data exists during training, which frequency is the highest, and only one piece of data exists during prediction, so that the prediction is dependent on the training process.
Exemplary electronic device
Fig. 3 is a structure of an electronic device according to an exemplary embodiment of the present invention. The electronic device may be either or both of the first device and the second device, or a stand-alone device separate from them, which stand-alone device may communicate with the first device and the second device to receive the acquired input signals therefrom. FIG. 3 illustrates a block diagram of an electronic device in accordance with an embodiment of the disclosure. As shown in fig. 3, the electronic device includes one or more processors 31 and memory 32.
The processor 31 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device to perform desired functions.
The input device 33 may also include, for example, a keyboard, a mouse, and the like.
The output device 34 can output various information to the outside. The output devices 34 may include, for example, a display, speakers, a printer, and a communication network and remote output devices connected thereto, among others.
Of course, for simplicity, only some of the components of the electronic device relevant to the present disclosure are shown in fig. 3, omitting components such as buses, input/output interfaces, and the like. In addition, the electronic device may include any other suitable components, depending on the particular application.
Exemplary computer program product and computer-readable storage Medium
In addition to the above-described methods and apparatus, embodiments of the present disclosure may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in the model feature rapid processing method according to various embodiments of the present disclosure described in the "exemplary methods" section of this specification above.
The computer program product may write program code for carrying out operations for embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.
Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform the steps in the model feature rapid processing method according to various embodiments of the present disclosure described in the "exemplary methods" section above in this specification.
The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The foregoing describes the general principles of the present disclosure in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present disclosure are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present disclosure. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the disclosure is not intended to be limited to the specific details so described.
In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts in the embodiments are referred to each other. For the system embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The block diagrams of devices, apparatuses, systems referred to in this disclosure are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".
The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.
It is also noted that in the devices, apparatuses, and methods of the present disclosure, each component or step can be decomposed and/or recombined. These decompositions and/or recombinations are to be considered equivalents of the present disclosure. The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit embodiments of the disclosure to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.
Claims (12)
1. A model feature rapid processing method comprises the following steps:
determining a plurality of function modules corresponding to different processing types of model characteristics, and respectively generating configuration subfiles to be configured aiming at the function modules;
classifying the features in the data set to be processed according to the feature types;
determining corresponding configuration subfiles to be configured according to different types of features, acquiring configuration parameters of each configuration subfile, and generating a complete configuration file;
and processing the corresponding type of features based on the generated complete configuration file.
2. The method for rapidly processing model features of claim 1, wherein: the determining of the functional modules of the plurality of functional modules corresponding to different processing types of the model features comprises missing value processing, normalization processing, binning processing and encoding processing.
3. The method for rapidly processing model features as claimed in claim 2, wherein: the configuration subfile to be configured has configuration parameters therein to provide developer configuration.
4. The method for rapidly processing model features of claim 1, wherein: in the step of classifying the features in the data set to be processed according to the feature types, the features in the data set to be processed are classified into numerical value type features and category type features, and a numerical value type feature list and a category type feature list are generated.
5. The method for rapidly processing model features as claimed in claim 2, wherein: in the step of determining corresponding configuration subfiles to be configured aiming at different types of features, acquiring configuration parameters of each configuration subfile and generating a complete configuration file, determining the configuration subfiles to be configured required for feature processing for the numerical value type features, acquiring the configuration parameters based on the determined configuration subfiles to be configured, and integrating the configuration subfiles after the configuration parameters are acquired to obtain the complete configuration file for processing the numerical value type features.
6. The method for rapidly processing model features of claim 5, wherein: for the numerical value type feature, the determined configuration subfiles to be configured comprise a configuration subfile to be configured for missing value processing, a configuration subfile to be configured for standardization processing and a configuration subfile to be configured for box separation processing.
7. The method for rapidly processing model features as claimed in claim 2, wherein: in the step of determining corresponding configuration subfiles to be configured aiming at different types of features, acquiring configuration parameters of each configuration subfile and generating a complete configuration file, determining the configuration subfiles to be configured required for feature processing for the type features, acquiring the configuration parameters based on the determined configuration subfiles to be configured, and integrating the configuration subfiles after the configuration parameters are acquired to obtain the complete configuration file for processing the type features.
8. The method for rapidly processing model features of claim 7, wherein: for the class type characteristics, the determined configuration subfiles to be configured comprise the configuration subfiles to be configured for missing value processing and the configuration subfiles to be configured for encoding processing.
9. The method for rapidly processing model features as claimed in claim 1, further comprising, after processing the features of the corresponding type based on the generated complete configuration file:
and generating a processing mirror image of the characteristic processing process in the training phase.
10. A model feature fast processing apparatus, comprising:
the configuration subfile generating module to be configured is used for determining a plurality of function modules corresponding to different processing types of the model characteristics and respectively generating the configuration subfiles to be configured aiming at the realization of each function module;
the characteristic classification module is used for classifying the characteristics in the data set to be processed according to the characteristic types;
the configuration file generation module is used for determining corresponding configuration subfiles to be configured according to different types of characteristics, acquiring configuration parameters of each configuration subfile and generating a complete configuration file;
and the feature processing module is used for processing the features of the corresponding types based on the configuration file generated by the configuration file generating module.
11. A storage medium on which a computer program is stored, wherein the computer program, when executed by a processor, implements the steps of the model feature fast processing method according to any one of claims 1 to 9.
12. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the model feature fast processing method according to any one of claims 1 to 9 when executing the computer program.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210425982.3A CN114816506B (en) | 2022-04-21 | 2022-04-21 | Quick processing method and device for model features, storage medium and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210425982.3A CN114816506B (en) | 2022-04-21 | 2022-04-21 | Quick processing method and device for model features, storage medium and electronic equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114816506A true CN114816506A (en) | 2022-07-29 |
CN114816506B CN114816506B (en) | 2024-08-09 |
Family
ID=82506455
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210425982.3A Active CN114816506B (en) | 2022-04-21 | 2022-04-21 | Quick processing method and device for model features, storage medium and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114816506B (en) |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108733639A (en) * | 2018-04-09 | 2018-11-02 | 中国平安人寿保险股份有限公司 | A kind of configuration parameter regulation means, device, terminal device and storage medium |
CN108764273A (en) * | 2018-04-09 | 2018-11-06 | 中国平安人寿保险股份有限公司 | A kind of method, apparatus of data processing, terminal device and storage medium |
CN109460396A (en) * | 2018-10-12 | 2019-03-12 | 中国平安人寿保险股份有限公司 | Model treatment method and device, storage medium and electronic equipment |
CN111382347A (en) * | 2018-12-28 | 2020-07-07 | 广州市百果园信息技术有限公司 | Object feature processing and information pushing method, device and equipment |
US20200285649A1 (en) * | 2019-03-04 | 2020-09-10 | Walmart Apollo, Llc | Systems and methods for a machine learning framework |
CN112394942A (en) * | 2020-11-24 | 2021-02-23 | 季明 | Distributed software development compiling method and software development platform based on cloud computing |
CN112487180A (en) * | 2019-09-12 | 2021-03-12 | 北京地平线机器人技术研发有限公司 | Text classification method and device, computer-readable storage medium and electronic equipment |
CN113094116A (en) * | 2021-04-01 | 2021-07-09 | 中国科学院软件研究所 | Deep learning application cloud configuration recommendation method and system based on load characteristic analysis |
US20210232915A1 (en) * | 2020-01-23 | 2021-07-29 | UMNAI Limited | Explainable neural net architecture for multidimensional data |
US20220101178A1 (en) * | 2020-09-25 | 2022-03-31 | EMC IP Holding Company LLC | Adaptive distributed learning model optimization for performance prediction under data privacy constraints |
-
2022
- 2022-04-21 CN CN202210425982.3A patent/CN114816506B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108733639A (en) * | 2018-04-09 | 2018-11-02 | 中国平安人寿保险股份有限公司 | A kind of configuration parameter regulation means, device, terminal device and storage medium |
CN108764273A (en) * | 2018-04-09 | 2018-11-06 | 中国平安人寿保险股份有限公司 | A kind of method, apparatus of data processing, terminal device and storage medium |
CN109460396A (en) * | 2018-10-12 | 2019-03-12 | 中国平安人寿保险股份有限公司 | Model treatment method and device, storage medium and electronic equipment |
CN111382347A (en) * | 2018-12-28 | 2020-07-07 | 广州市百果园信息技术有限公司 | Object feature processing and information pushing method, device and equipment |
US20200285649A1 (en) * | 2019-03-04 | 2020-09-10 | Walmart Apollo, Llc | Systems and methods for a machine learning framework |
CN112487180A (en) * | 2019-09-12 | 2021-03-12 | 北京地平线机器人技术研发有限公司 | Text classification method and device, computer-readable storage medium and electronic equipment |
US20210232915A1 (en) * | 2020-01-23 | 2021-07-29 | UMNAI Limited | Explainable neural net architecture for multidimensional data |
US20220101178A1 (en) * | 2020-09-25 | 2022-03-31 | EMC IP Holding Company LLC | Adaptive distributed learning model optimization for performance prediction under data privacy constraints |
CN112394942A (en) * | 2020-11-24 | 2021-02-23 | 季明 | Distributed software development compiling method and software development platform based on cloud computing |
CN113094116A (en) * | 2021-04-01 | 2021-07-09 | 中国科学院软件研究所 | Deep learning application cloud configuration recommendation method and system based on load characteristic analysis |
Non-Patent Citations (7)
Title |
---|
C. SHANTHI等: "An artificial intelligence based improved classification of two-phase flow patterns with feature extracted from acquired images", 《ISA TRANSACTIONS》, vol. 68, 31 May 2017 (2017-05-31), pages 425 - 432, XP029987566, DOI: 10.1016/j.isatra.2016.10.021 * |
CHLOEZHAO: "特征工程——特征分类及不同类特征的处理方式", pages 1 - 2, Retrieved from the Internet <URL:https://blog.csdn.net/chloezhao/article/details/53444856> * |
MANTCH: "BERT预训练模型的演进过程!(附代码)", pages 1 - 12, Retrieved from the Internet <URL:https://www.cnblogs.com/mantch/p/11605111.html> * |
刘坚: "基于ARM的嵌入式人脸识别系统设计及实现", 《中国优秀硕士学位论文全文数据库(信息科技辑)》, no. 01, 15 January 2018 (2018-01-15), pages 138 - 1055 * |
张猛: "基于网络摄像头的人脸识别系统设计", 《中国优秀硕士学位论文全文数据库(信息科技辑)》, no. 05, 15 May 2021 (2021-05-15), pages 138 - 1025 * |
白媛媛: "面向VMware的漏洞检测模型的设计与实现", 《中国优秀硕士学位论文全文数据库(信息科技辑)》, no. 07, 15 July 2016 (2016-07-15), pages 138 - 30 * |
胡雪晴: "Android应用自动化测试中UI人机交互方法的设计与实现", 《中国优秀硕士学位论文全文数据库(信息科技辑)》, no. 01, 15 January 2022 (2022-01-15), pages 138 - 414 * |
Also Published As
Publication number | Publication date |
---|---|
CN114816506B (en) | 2024-08-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11741361B2 (en) | Machine learning-based network model building method and apparatus | |
CN109101537B (en) | Multi-turn dialogue data classification method and device based on deep learning and electronic equipment | |
CN115994177B (en) | Intellectual property management method and system based on data lake | |
US20090112752A1 (en) | Distributed Scoring of Data Transactions | |
CN112883154B (en) | Text topic mining method and device, computer equipment and storage medium | |
CN112364937A (en) | User category determination method and device, recommended content determination method and electronic equipment | |
US20230004979A1 (en) | Abnormal behavior detection method and apparatus, electronic device, and computer-readable storage medium | |
CN110647832A (en) | Method and device for acquiring information in certificate, electronic equipment and storage medium | |
CN112612887A (en) | Log processing method, device, equipment and storage medium | |
CN115526320A (en) | Neural network model inference acceleration method, apparatus, electronic device and medium | |
CN110264311B (en) | Business promotion information accurate recommendation method and system based on deep learning | |
US20220172065A1 (en) | Automatic ontology generation by embedding representations | |
CN114398396A (en) | Data query method, storage medium, and computer program product | |
CN117234369B (en) | Digital human interaction method and system, computer readable storage medium and digital human equipment | |
CN114816506A (en) | Model feature rapid processing method and device, storage medium and electronic equipment | |
CN117390473A (en) | Object processing method and device | |
CN116128250A (en) | Electricity selling deviation management method and system for spot market | |
Romero-Gainza et al. | Memory mapping and parallelizing random forests for speed and cache efficiency | |
CN113536252B (en) | Account identification method and computer-readable storage medium | |
CN114726870A (en) | Hybrid cloud resource arrangement method and system based on visual dragging and electronic equipment | |
CN112801226A (en) | Data screening method and device, computer readable storage medium and electronic equipment | |
EP3460677A1 (en) | Assessment program, assessment device, and assessment method | |
CN111639714A (en) | Method, device and equipment for determining attributes of users | |
US11657216B2 (en) | Input text management | |
US20240311413A1 (en) | Determining input sentiment by automatically processing text data and non-text data using artificial intelligence techniques |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |