CN114490809A - Data mining method and device based on feature transformation function and readable medium - Google Patents
Data mining method and device based on feature transformation function and readable medium Download PDFInfo
- Publication number
- CN114490809A CN114490809A CN202111640805.9A CN202111640805A CN114490809A CN 114490809 A CN114490809 A CN 114490809A CN 202111640805 A CN202111640805 A CN 202111640805A CN 114490809 A CN114490809 A CN 114490809A
- Authority
- CN
- China
- Prior art keywords
- data
- feature
- mining
- data mining
- function
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2282—Tablespace storage structures; Management thereof
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/242—Query formulation
- G06F16/2433—Query languages
- G06F16/244—Grouping and aggregation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24553—Query execution of query operations
- G06F16/24558—Binary matching operations
- G06F16/2456—Join operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Fuzzy Systems (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention relates to a data mining method, equipment and a readable medium based on a characteristic transfer function, belonging to the technical field of computers, wherein the method comprises the following steps: extracting original data: extracting original data required by feature mining, and performing two times of lossless information compression processing on complex multidimensional original data to form a data structure required by a feature transfer function; format conversion: and (3) constructing a format conversion function, converting the Y multiplied by M data table generated in the step (1) into a format of list and dit which can be processed in a Python environment, and using the format as an input item of feature generation. And (3) feature generation: and constructing a characteristic conversion function, and converting the input original data in the specific format into final characteristics. The invention establishes a data mining method based on a characteristic transfer function structure, and utilizes the characteristic transfer function to construct a more advanced and efficient characteristic data mining method.
Description
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a data mining method, device, and readable medium based on a feature transformation function.
Background
With the rapid development of the machine learning technology, the machine learning technology is widely applied to data analysis processes of various industries, great convenience is brought to people, in the prior art, the characteristic mining directly uses SQL sentences to query original data, the characteristic is calculated by using the own operational capability of SQL, and finally the query and the operation are coupled together. Meanwhile, due to the limitation of the operation and resources of SQL, a large number of useless temporary intermediate tables can be generated in the process of feature mining, and the method belongs to invalid occupation of resources.
Disclosure of Invention
In order to solve the technical problems in the prior art, the invention establishes a data mining method, equipment and a readable medium based on a feature transfer function, and utilizes the feature transfer function to establish a more advanced and efficient feature data mining method.
The technical scheme adopted by the invention is as follows:
the data mining method based on the feature transfer function comprises the following steps:
step 1: extracting original data required by feature mining in a database through SQL sentences, compressing the multidimensional original data to form a data structure required by a feature conversion function, generating a data table structure, and storing the generated data table structure into a newly-built storage space of a storage;
step 2: calling a data table structure stored in a memory, and converting the data table structure into list and dit data formats corresponding to a Python environment as an input item of feature generation;
and step 3: and (3) constructing a feature conversion function, and converting the data converted into the list and the dit data formats in the step (2) into final features through the feature conversion function.
The method adopts Python as a main operation tool, so that the mining potential of future complex features is greatly improved. And the new data structure can clearly store and trace the original data before the feature generation, can rapidly modify and trace the feature, and can debug and test the feature processing process. In addition, after the steps are carried out, the characteristics are modularly configurable, a reusable data module structure is formed after optimization, the flexibility and the clarity are realized, the fault tolerance rate is high, and the development period of data mining is effectively shortened.
Preferably, step 1 specifically comprises the following steps:
step 1.1: extracting all fields required in the characteristic processing process from an original bottom table of a database, fixing each field in a character string or numerical value format by using a concat function, and compressing the fields into 1 record to form a Y multiplied by N dimensional data table;
step 1.2: aggregating all data in the Y multiplied by N dimensional data table into user dimensional data through a group by method and a collect _ list function, compressing N records into 1 data, and forming a Y multiplied by 1 data table;
step 1.3: m data tables of Yx 1 are defined, and the M data tables of Yx 1 are combined together to finally form an original data table of Yx M.
The database query of the invention only relates to the extraction of the original field, and does not contain any operation content, thereby reducing most links for generating the temporary table of the middleware and optimizing the configuration of database resources compared with the prior art.
Preferably, step 3 specifically comprises the following steps:
step 3.1: preprocessing and processing the data converted into the list and dit data formats corresponding to the Python environment in the step 2;
step 3.2: an initial default value for each feature is defined, and then the value of each feature is calculated according to the logic of mining.
Preferably, said step 3.1 includes performing a feature transformation on the raw data and adding a data derivation field.
Preferably, the compression process in step 1 is: the multidimensional original data is subjected to lossless information compression twice.
A computer device comprising a processor, a memory, and a data mining program stored on the memory and executable by the processor, wherein the data mining program, when executed by the processor, implements a data mining method of any of the above feature transformation functions.
A computer readable storage medium having a data mining program stored thereon, wherein the data mining program, when executed by a processor, implements the data mining steps of any one of the above feature transformation functions.
The beneficial effects of the invention include:
1. the method uses more advanced Python as a main operation tool, so that the excavation potential of complex features is greatly improved, the running speed is improved by 5-6 times under the condition of generating the same features, and the running efficiency is effectively improved.
2. The invention forms a reusable data module structure, has higher fault tolerance rate, effectively shortens the development period of data mining, can clearly store and backtrack the original data before feature generation by the new data module structure, can rapidly modify, and performs debug or test on the feature processing process.
Drawings
Description of reference numerals:
FIG. 1 is a schematic flow chart of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of embodiments of the present application, generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.
The invention is described in further detail below with reference to the accompanying drawing 1:
the data mining method based on the feature transfer function comprises the following steps:
step 1: extracting original data required by feature mining in a database through SQL sentences, performing two times of lossless information compression processing on the multidimensional original data to form a data structure required by a feature conversion function, generating a data table structure, and storing the generated data table structure into a newly-built storage space of a storage;
the step 1 specifically comprises the following steps:
step 1.1: extracting all fields required in the characteristic processing process from an original bottom table of a database; defining Y users, each user has N records, fixing each field in a character string or numerical value format by using a 'concat' function, and compressing the fields into 1 record to form a Y multiplied by N dimensional data table;
step 1.2: aggregating all data in the Y multiplied by N dimensional data table into user dimensional data through a group by method and a collect _ list function, compressing N records into 1 data, and forming a Y multiplied by 1 data table; the described
Step 1.3: m data tables of Yx 1 are defined, and the M data tables of Yx 1 are combined together to finally form an original data table of Yx M. Because each data module may have a plurality of bottom-layer tables, the invention defines that each data module has M bottom-layer tables, and M takes any positive integer from 1 to n.
Step 2: calling a data table structure stored in a memory, and converting the data table structure into list and dit data formats corresponding to a Python environment as an input item of feature generation;
and step 3: and (3) constructing a feature conversion function, and converting the data converted into the list and dit data formats in the step (2) into final features.
The step 3 specifically comprises the following steps:
step 3.1: preprocessing and processing the data converted into the list and dit data formats corresponding to the Python environment in the step 2; such as converting formats and adding data derivative fields, etc.
Step 3.2: an initial default value for each feature is defined, and then the value of each feature is calculated according to the logic of mining.
The calculation of the value of each feature in step 3.2 includes, but is not limited to, the following: calculating and calculating various statistical values; the calculation of various statistical values includes mean, standard deviation, maximum value, minimum value and the like.
The method adopts Python as a main operation tool, so that the mining potential of future complex features is greatly improved. And the new data structure can clearly store and trace the original data before the feature generation, can rapidly modify and trace the feature, and can debug and test the feature processing process. In addition, after the steps are carried out, the characteristics are modularly configurable, a reusable data module structure is formed after optimization, the flexibility and the clarity are realized, the fault tolerance rate is high, and the development period of data mining is effectively shortened.
The database query of the invention only relates to the extraction of the original field, and does not contain any operation content, thereby reducing most links for generating the temporary table of the middleware and optimizing the configuration of database resources compared with the prior art.
A computer device comprising a processor, a memory, and a data mining program stored on the memory and executable by the processor, wherein the data mining program, when executed by the processor, implements a data mining method of any of the above feature transformation functions.
A computer readable storage medium having a data mining program stored thereon, wherein the data mining program, when executed by a processor, implements the data mining steps of any one of the above feature transformation functions.
The above-mentioned embodiments only express the specific embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for those skilled in the art, without departing from the technical idea of the present application, several changes and modifications can be made, which are all within the protection scope of the present application.
Claims (7)
1. The data mining method based on the feature transfer function is characterized by comprising the following steps:
step 1: extracting original data required by feature mining in a database through SQL sentences, compressing the multidimensional original data to form a data structure required by a feature conversion function, generating a data table structure, and storing the generated data table structure into a newly-built storage space of a storage;
step 2: calling a data table structure stored in a memory, and converting the data table structure into list and dit data formats corresponding to a Python environment as an input item of feature generation;
and step 3: and (3) constructing a feature conversion function, and converting the data converted into the list and the dit data formats in the step (2) into final features through the feature conversion function.
2. The feature transformation function-based data mining method according to claim 1, wherein the step 1 specifically comprises the steps of:
step 1.1: extracting all fields required in the feature processing process from an original bottom layer table, fixing each field in a character string or numerical value format by using a concat function, compressing the fields into 1 record, defining Y users, and forming a Y multiplied by N dimensional data table, wherein each user has N records;
step 1.2: aggregating all data in the Y multiplied by N dimensional data table into user dimensional data through a group by method and a collect _ list function, compressing N records into 1 data to form a Y multiplied by 1 data table;
step 1.3: m data tables of Yx 1 are defined, and the M data tables of Yx 1 are combined together to finally form an original data table of Yx M.
3. The feature transformation function-based data mining method according to claim 1, wherein the step 3 specifically comprises the steps of:
step 3.1: preprocessing and processing the data converted into the list and dit data formats corresponding to the Python environment in the step 2;
step 3.2: an initial default value for each feature is defined, and then the value of each feature is calculated according to the logic of mining.
4. A method of feature transformation function based data mining as claimed in claim 3, wherein step 3.1 comprises feature transforming the raw data and adding data derivative fields.
5. The feature transformation function-based data mining method according to claim 1, wherein the compression process in step 1 is: the multidimensional original data is subjected to lossless information compression twice.
6. A computer device comprising a processor, a memory, and a data mining program stored on the memory and executable by the processor, wherein the data mining program, when executed by the processor, implements the steps of the data mining method of any of claims 1 to 5.
7. A computer-readable storage medium, having a data mining program stored thereon, wherein the program, when executed by a processor, implements the steps of the data mining method of any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111640805.9A CN114490809A (en) | 2021-12-29 | 2021-12-29 | Data mining method and device based on feature transformation function and readable medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111640805.9A CN114490809A (en) | 2021-12-29 | 2021-12-29 | Data mining method and device based on feature transformation function and readable medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114490809A true CN114490809A (en) | 2022-05-13 |
Family
ID=81497170
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111640805.9A Pending CN114490809A (en) | 2021-12-29 | 2021-12-29 | Data mining method and device based on feature transformation function and readable medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114490809A (en) |
-
2021
- 2021-12-29 CN CN202111640805.9A patent/CN114490809A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106168965B (en) | Knowledge graph construction system | |
CN111506621B (en) | Data statistical method and device | |
CN109753502B (en) | Data acquisition method based on NiFi | |
CN108446313B (en) | Data format conversion method and device | |
CN102110123B (en) | Method for establishing inverted index | |
CN101630323B (en) | Method for compressing space of deterministic automaton | |
CN108664635B (en) | Method, device, equipment and storage medium for acquiring database statistical information | |
CN103279478A (en) | Method for extracting features based on distributed mutual information documents | |
EP3872703B1 (en) | Method and device for classifying face image, electronic device and storage medium | |
CN112699278A (en) | Graph database construction method and device, readable storage medium and electronic equipment | |
CN105589908A (en) | Association rule computing method for transaction set | |
CN111858730A (en) | Data importing and exporting device, method, equipment and medium of graph database | |
CN110598003A (en) | Knowledge graph construction system and construction method based on public data resource catalog | |
CN114443656A (en) | Customizable automated data model analysis tool and use method thereof | |
CN112905642B (en) | Method for storing IEC61850 report data into relational database based on CSV mapping file | |
CN114139040A (en) | Data storage and query method, device, equipment and readable storage medium | |
CN106874479A (en) | The improved method and device of the FP Growth algorithms based on FPGA | |
CN110874366A (en) | Data processing and query method and device | |
CN113779349A (en) | Data retrieval system, apparatus, electronic device, and readable storage medium | |
CN114490809A (en) | Data mining method and device based on feature transformation function and readable medium | |
CN106874498B (en) | Financial data access method and access device | |
CN115794756A (en) | Simulation file compression method and device, equipment and storage medium | |
CN112306421B (en) | Method and system for storing MDF file in analysis and measurement data format | |
CN108228759A (en) | Storage processing method, device, computer equipment and the storage medium of record set | |
CN113836211B (en) | Data extraction method for accessing data of internet of things equipment to JAVA platform |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |