CN114490809A

CN114490809A - Data mining method and device based on feature transformation function and readable medium

Info

Publication number: CN114490809A
Application number: CN202111640805.9A
Authority: CN
Inventors: 张菲兰
Original assignee: Sichuan XW Bank Co Ltd
Current assignee: Sichuan XW Bank Co Ltd
Priority date: 2021-12-29
Filing date: 2021-12-29
Publication date: 2022-05-13

Abstract

The invention relates to a data mining method, equipment and a readable medium based on a characteristic transfer function, belonging to the technical field of computers, wherein the method comprises the following steps: extracting original data: extracting original data required by feature mining, and performing two times of lossless information compression processing on complex multidimensional original data to form a data structure required by a feature transfer function; format conversion: and (3) constructing a format conversion function, converting the Y multiplied by M data table generated in the step (1) into a format of list and dit which can be processed in a Python environment, and using the format as an input item of feature generation. And (3) feature generation: and constructing a characteristic conversion function, and converting the input original data in the specific format into final characteristics. The invention establishes a data mining method based on a characteristic transfer function structure, and utilizes the characteristic transfer function to construct a more advanced and efficient characteristic data mining method.

Description

Data mining method and device based on feature transformation function and readable medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a data mining method, device, and readable medium based on a feature transformation function.

Background

With the rapid development of the machine learning technology, the machine learning technology is widely applied to data analysis processes of various industries, great convenience is brought to people, in the prior art, the characteristic mining directly uses SQL sentences to query original data, the characteristic is calculated by using the own operational capability of SQL, and finally the query and the operation are coupled together. Meanwhile, due to the limitation of the operation and resources of SQL, a large number of useless temporary intermediate tables can be generated in the process of feature mining, and the method belongs to invalid occupation of resources.

Disclosure of Invention

In order to solve the technical problems in the prior art, the invention establishes a data mining method, equipment and a readable medium based on a feature transfer function, and utilizes the feature transfer function to establish a more advanced and efficient feature data mining method.

The technical scheme adopted by the invention is as follows:

the data mining method based on the feature transfer function comprises the following steps:

step 1: extracting original data required by feature mining in a database through SQL sentences, compressing the multidimensional original data to form a data structure required by a feature conversion function, generating a data table structure, and storing the generated data table structure into a newly-built storage space of a storage;

step 2: calling a data table structure stored in a memory, and converting the data table structure into list and dit data formats corresponding to a Python environment as an input item of feature generation;

and step 3: and (3) constructing a feature conversion function, and converting the data converted into the list and the dit data formats in the step (2) into final features through the feature conversion function.

The method adopts Python as a main operation tool, so that the mining potential of future complex features is greatly improved. And the new data structure can clearly store and trace the original data before the feature generation, can rapidly modify and trace the feature, and can debug and test the feature processing process. In addition, after the steps are carried out, the characteristics are modularly configurable, a reusable data module structure is formed after optimization, the flexibility and the clarity are realized, the fault tolerance rate is high, and the development period of data mining is effectively shortened.

Preferably, step 1 specifically comprises the following steps:

step 1.1: extracting all fields required in the characteristic processing process from an original bottom table of a database, fixing each field in a character string or numerical value format by using a concat function, and compressing the fields into 1 record to form a Y multiplied by N dimensional data table;

step 1.2: aggregating all data in the Y multiplied by N dimensional data table into user dimensional data through a group by method and a collect _ list function, compressing N records into 1 data, and forming a Y multiplied by 1 data table;

step 1.3: m data tables of Yx 1 are defined, and the M data tables of Yx 1 are combined together to finally form an original data table of Yx M.

The database query of the invention only relates to the extraction of the original field, and does not contain any operation content, thereby reducing most links for generating the temporary table of the middleware and optimizing the configuration of database resources compared with the prior art.

Preferably, step 3 specifically comprises the following steps:

step 3.1: preprocessing and processing the data converted into the list and dit data formats corresponding to the Python environment in the step 2;

step 3.2: an initial default value for each feature is defined, and then the value of each feature is calculated according to the logic of mining.

Preferably, said step 3.1 includes performing a feature transformation on the raw data and adding a data derivation field.

Preferably, the compression process in step 1 is: the multidimensional original data is subjected to lossless information compression twice.

A computer device comprising a processor, a memory, and a data mining program stored on the memory and executable by the processor, wherein the data mining program, when executed by the processor, implements a data mining method of any of the above feature transformation functions.

A computer readable storage medium having a data mining program stored thereon, wherein the data mining program, when executed by a processor, implements the data mining steps of any one of the above feature transformation functions.

The beneficial effects of the invention include:

1. the method uses more advanced Python as a main operation tool, so that the excavation potential of complex features is greatly improved, the running speed is improved by 5-6 times under the condition of generating the same features, and the running efficiency is effectively improved.

2. The invention forms a reusable data module structure, has higher fault tolerance rate, effectively shortens the development period of data mining, can clearly store and backtrack the original data before feature generation by the new data module structure, can rapidly modify, and performs debug or test on the feature processing process.

Drawings

Description of reference numerals:

FIG. 1 is a schematic flow chart of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of embodiments of the present application, generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.

The invention is described in further detail below with reference to the accompanying drawing 1:

step 1: extracting original data required by feature mining in a database through SQL sentences, performing two times of lossless information compression processing on the multidimensional original data to form a data structure required by a feature conversion function, generating a data table structure, and storing the generated data table structure into a newly-built storage space of a storage;

the step 1 specifically comprises the following steps:

step 1.1: extracting all fields required in the characteristic processing process from an original bottom table of a database; defining Y users, each user has N records, fixing each field in a character string or numerical value format by using a 'concat' function, and compressing the fields into 1 record to form a Y multiplied by N dimensional data table;

step 1.2: aggregating all data in the Y multiplied by N dimensional data table into user dimensional data through a group by method and a collect _ list function, compressing N records into 1 data, and forming a Y multiplied by 1 data table; the described

Step 1.3: m data tables of Yx 1 are defined, and the M data tables of Yx 1 are combined together to finally form an original data table of Yx M. Because each data module may have a plurality of bottom-layer tables, the invention defines that each data module has M bottom-layer tables, and M takes any positive integer from 1 to n.

and step 3: and (3) constructing a feature conversion function, and converting the data converted into the list and dit data formats in the step (2) into final features.

The step 3 specifically comprises the following steps:

step 3.1: preprocessing and processing the data converted into the list and dit data formats corresponding to the Python environment in the step 2; such as converting formats and adding data derivative fields, etc.

The calculation of the value of each feature in step 3.2 includes, but is not limited to, the following: calculating and calculating various statistical values; the calculation of various statistical values includes mean, standard deviation, maximum value, minimum value and the like.

The above-mentioned embodiments only express the specific embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for those skilled in the art, without departing from the technical idea of the present application, several changes and modifications can be made, which are all within the protection scope of the present application.

Claims

1. The data mining method based on the feature transfer function is characterized by comprising the following steps:

2. The feature transformation function-based data mining method according to claim 1, wherein the step 1 specifically comprises the steps of:

step 1.1: extracting all fields required in the feature processing process from an original bottom layer table, fixing each field in a character string or numerical value format by using a concat function, compressing the fields into 1 record, defining Y users, and forming a Y multiplied by N dimensional data table, wherein each user has N records;

step 1.2: aggregating all data in the Y multiplied by N dimensional data table into user dimensional data through a group by method and a collect _ list function, compressing N records into 1 data to form a Y multiplied by 1 data table;

3. The feature transformation function-based data mining method according to claim 1, wherein the step 3 specifically comprises the steps of:

4. A method of feature transformation function based data mining as claimed in claim 3, wherein step 3.1 comprises feature transforming the raw data and adding data derivative fields.

5. The feature transformation function-based data mining method according to claim 1, wherein the compression process in step 1 is: the multidimensional original data is subjected to lossless information compression twice.

6. A computer device comprising a processor, a memory, and a data mining program stored on the memory and executable by the processor, wherein the data mining program, when executed by the processor, implements the steps of the data mining method of any of claims 1 to 5.

7. A computer-readable storage medium, having a data mining program stored thereon, wherein the program, when executed by a processor, implements the steps of the data mining method of any one of claims 1 to 5.