CN114490809A - Data mining method and device based on feature transformation function and readable medium - Google Patents

Data mining method and device based on feature transformation function and readable medium Download PDF

Info

Publication number
CN114490809A
CN114490809A CN202111640805.9A CN202111640805A CN114490809A CN 114490809 A CN114490809 A CN 114490809A CN 202111640805 A CN202111640805 A CN 202111640805A CN 114490809 A CN114490809 A CN 114490809A
Authority
CN
China
Prior art keywords
data
feature
mining
data mining
function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111640805.9A
Other languages
Chinese (zh)
Inventor
张菲兰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan XW Bank Co Ltd
Original Assignee
Sichuan XW Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan XW Bank Co Ltd filed Critical Sichuan XW Bank Co Ltd
Priority to CN202111640805.9A priority Critical patent/CN114490809A/en
Publication of CN114490809A publication Critical patent/CN114490809A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • G06F16/244Grouping and aggregation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24558Binary matching operations
    • G06F16/2456Join operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Fuzzy Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a data mining method, equipment and a readable medium based on a characteristic transfer function, belonging to the technical field of computers, wherein the method comprises the following steps: extracting original data: extracting original data required by feature mining, and performing two times of lossless information compression processing on complex multidimensional original data to form a data structure required by a feature transfer function; format conversion: and (3) constructing a format conversion function, converting the Y multiplied by M data table generated in the step (1) into a format of list and dit which can be processed in a Python environment, and using the format as an input item of feature generation. And (3) feature generation: and constructing a characteristic conversion function, and converting the input original data in the specific format into final characteristics. The invention establishes a data mining method based on a characteristic transfer function structure, and utilizes the characteristic transfer function to construct a more advanced and efficient characteristic data mining method.

Description

Data mining method and device based on feature transformation function and readable medium
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a data mining method, device, and readable medium based on a feature transformation function.
Background
With the rapid development of the machine learning technology, the machine learning technology is widely applied to data analysis processes of various industries, great convenience is brought to people, in the prior art, the characteristic mining directly uses SQL sentences to query original data, the characteristic is calculated by using the own operational capability of SQL, and finally the query and the operation are coupled together. Meanwhile, due to the limitation of the operation and resources of SQL, a large number of useless temporary intermediate tables can be generated in the process of feature mining, and the method belongs to invalid occupation of resources.
Disclosure of Invention
In order to solve the technical problems in the prior art, the invention establishes a data mining method, equipment and a readable medium based on a feature transfer function, and utilizes the feature transfer function to establish a more advanced and efficient feature data mining method.
The technical scheme adopted by the invention is as follows:
the data mining method based on the feature transfer function comprises the following steps:
step 1: extracting original data required by feature mining in a database through SQL sentences, compressing the multidimensional original data to form a data structure required by a feature conversion function, generating a data table structure, and storing the generated data table structure into a newly-built storage space of a storage;
step 2: calling a data table structure stored in a memory, and converting the data table structure into list and dit data formats corresponding to a Python environment as an input item of feature generation;
and step 3: and (3) constructing a feature conversion function, and converting the data converted into the list and the dit data formats in the step (2) into final features through the feature conversion function.
The method adopts Python as a main operation tool, so that the mining potential of future complex features is greatly improved. And the new data structure can clearly store and trace the original data before the feature generation, can rapidly modify and trace the feature, and can debug and test the feature processing process. In addition, after the steps are carried out, the characteristics are modularly configurable, a reusable data module structure is formed after optimization, the flexibility and the clarity are realized, the fault tolerance rate is high, and the development period of data mining is effectively shortened.
Preferably, step 1 specifically comprises the following steps:
step 1.1: extracting all fields required in the characteristic processing process from an original bottom table of a database, fixing each field in a character string or numerical value format by using a concat function, and compressing the fields into 1 record to form a Y multiplied by N dimensional data table;
step 1.2: aggregating all data in the Y multiplied by N dimensional data table into user dimensional data through a group by method and a collect _ list function, compressing N records into 1 data, and forming a Y multiplied by 1 data table;
step 1.3: m data tables of Yx 1 are defined, and the M data tables of Yx 1 are combined together to finally form an original data table of Yx M.
The database query of the invention only relates to the extraction of the original field, and does not contain any operation content, thereby reducing most links for generating the temporary table of the middleware and optimizing the configuration of database resources compared with the prior art.
Preferably, step 3 specifically comprises the following steps:
step 3.1: preprocessing and processing the data converted into the list and dit data formats corresponding to the Python environment in the step 2;
step 3.2: an initial default value for each feature is defined, and then the value of each feature is calculated according to the logic of mining.
Preferably, said step 3.1 includes performing a feature transformation on the raw data and adding a data derivation field.
Preferably, the compression process in step 1 is: the multidimensional original data is subjected to lossless information compression twice.
A computer device comprising a processor, a memory, and a data mining program stored on the memory and executable by the processor, wherein the data mining program, when executed by the processor, implements a data mining method of any of the above feature transformation functions.
A computer readable storage medium having a data mining program stored thereon, wherein the data mining program, when executed by a processor, implements the data mining steps of any one of the above feature transformation functions.
The beneficial effects of the invention include:
1. the method uses more advanced Python as a main operation tool, so that the excavation potential of complex features is greatly improved, the running speed is improved by 5-6 times under the condition of generating the same features, and the running efficiency is effectively improved.
2. The invention forms a reusable data module structure, has higher fault tolerance rate, effectively shortens the development period of data mining, can clearly store and backtrack the original data before feature generation by the new data module structure, can rapidly modify, and performs debug or test on the feature processing process.
Drawings
Description of reference numerals:
FIG. 1 is a schematic flow chart of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of embodiments of the present application, generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.
The invention is described in further detail below with reference to the accompanying drawing 1:
the data mining method based on the feature transfer function comprises the following steps:
step 1: extracting original data required by feature mining in a database through SQL sentences, performing two times of lossless information compression processing on the multidimensional original data to form a data structure required by a feature conversion function, generating a data table structure, and storing the generated data table structure into a newly-built storage space of a storage;
the step 1 specifically comprises the following steps:
step 1.1: extracting all fields required in the characteristic processing process from an original bottom table of a database; defining Y users, each user has N records, fixing each field in a character string or numerical value format by using a 'concat' function, and compressing the fields into 1 record to form a Y multiplied by N dimensional data table;
step 1.2: aggregating all data in the Y multiplied by N dimensional data table into user dimensional data through a group by method and a collect _ list function, compressing N records into 1 data, and forming a Y multiplied by 1 data table; the described
Step 1.3: m data tables of Yx 1 are defined, and the M data tables of Yx 1 are combined together to finally form an original data table of Yx M. Because each data module may have a plurality of bottom-layer tables, the invention defines that each data module has M bottom-layer tables, and M takes any positive integer from 1 to n.
Step 2: calling a data table structure stored in a memory, and converting the data table structure into list and dit data formats corresponding to a Python environment as an input item of feature generation;
and step 3: and (3) constructing a feature conversion function, and converting the data converted into the list and dit data formats in the step (2) into final features.
The step 3 specifically comprises the following steps:
step 3.1: preprocessing and processing the data converted into the list and dit data formats corresponding to the Python environment in the step 2; such as converting formats and adding data derivative fields, etc.
Step 3.2: an initial default value for each feature is defined, and then the value of each feature is calculated according to the logic of mining.
The calculation of the value of each feature in step 3.2 includes, but is not limited to, the following: calculating and calculating various statistical values; the calculation of various statistical values includes mean, standard deviation, maximum value, minimum value and the like.
The method adopts Python as a main operation tool, so that the mining potential of future complex features is greatly improved. And the new data structure can clearly store and trace the original data before the feature generation, can rapidly modify and trace the feature, and can debug and test the feature processing process. In addition, after the steps are carried out, the characteristics are modularly configurable, a reusable data module structure is formed after optimization, the flexibility and the clarity are realized, the fault tolerance rate is high, and the development period of data mining is effectively shortened.
The database query of the invention only relates to the extraction of the original field, and does not contain any operation content, thereby reducing most links for generating the temporary table of the middleware and optimizing the configuration of database resources compared with the prior art.
A computer device comprising a processor, a memory, and a data mining program stored on the memory and executable by the processor, wherein the data mining program, when executed by the processor, implements a data mining method of any of the above feature transformation functions.
A computer readable storage medium having a data mining program stored thereon, wherein the data mining program, when executed by a processor, implements the data mining steps of any one of the above feature transformation functions.
The above-mentioned embodiments only express the specific embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for those skilled in the art, without departing from the technical idea of the present application, several changes and modifications can be made, which are all within the protection scope of the present application.

Claims (7)

1. The data mining method based on the feature transfer function is characterized by comprising the following steps:
step 1: extracting original data required by feature mining in a database through SQL sentences, compressing the multidimensional original data to form a data structure required by a feature conversion function, generating a data table structure, and storing the generated data table structure into a newly-built storage space of a storage;
step 2: calling a data table structure stored in a memory, and converting the data table structure into list and dit data formats corresponding to a Python environment as an input item of feature generation;
and step 3: and (3) constructing a feature conversion function, and converting the data converted into the list and the dit data formats in the step (2) into final features through the feature conversion function.
2. The feature transformation function-based data mining method according to claim 1, wherein the step 1 specifically comprises the steps of:
step 1.1: extracting all fields required in the feature processing process from an original bottom layer table, fixing each field in a character string or numerical value format by using a concat function, compressing the fields into 1 record, defining Y users, and forming a Y multiplied by N dimensional data table, wherein each user has N records;
step 1.2: aggregating all data in the Y multiplied by N dimensional data table into user dimensional data through a group by method and a collect _ list function, compressing N records into 1 data to form a Y multiplied by 1 data table;
step 1.3: m data tables of Yx 1 are defined, and the M data tables of Yx 1 are combined together to finally form an original data table of Yx M.
3. The feature transformation function-based data mining method according to claim 1, wherein the step 3 specifically comprises the steps of:
step 3.1: preprocessing and processing the data converted into the list and dit data formats corresponding to the Python environment in the step 2;
step 3.2: an initial default value for each feature is defined, and then the value of each feature is calculated according to the logic of mining.
4. A method of feature transformation function based data mining as claimed in claim 3, wherein step 3.1 comprises feature transforming the raw data and adding data derivative fields.
5. The feature transformation function-based data mining method according to claim 1, wherein the compression process in step 1 is: the multidimensional original data is subjected to lossless information compression twice.
6. A computer device comprising a processor, a memory, and a data mining program stored on the memory and executable by the processor, wherein the data mining program, when executed by the processor, implements the steps of the data mining method of any of claims 1 to 5.
7. A computer-readable storage medium, having a data mining program stored thereon, wherein the program, when executed by a processor, implements the steps of the data mining method of any one of claims 1 to 5.
CN202111640805.9A 2021-12-29 2021-12-29 Data mining method and device based on feature transformation function and readable medium Pending CN114490809A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111640805.9A CN114490809A (en) 2021-12-29 2021-12-29 Data mining method and device based on feature transformation function and readable medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111640805.9A CN114490809A (en) 2021-12-29 2021-12-29 Data mining method and device based on feature transformation function and readable medium

Publications (1)

Publication Number Publication Date
CN114490809A true CN114490809A (en) 2022-05-13

Family

ID=81497170

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111640805.9A Pending CN114490809A (en) 2021-12-29 2021-12-29 Data mining method and device based on feature transformation function and readable medium

Country Status (1)

Country Link
CN (1) CN114490809A (en)

Similar Documents

Publication Publication Date Title
CN106168965B (en) Knowledge graph construction system
CN111506621B (en) Data statistical method and device
CN109753502B (en) Data acquisition method based on NiFi
CN108446313B (en) Data format conversion method and device
CN102110123B (en) Method for establishing inverted index
CN101630323B (en) Method for compressing space of deterministic automaton
CN108664635B (en) Method, device, equipment and storage medium for acquiring database statistical information
CN103279478A (en) Method for extracting features based on distributed mutual information documents
EP3872703B1 (en) Method and device for classifying face image, electronic device and storage medium
CN112699278A (en) Graph database construction method and device, readable storage medium and electronic equipment
CN105589908A (en) Association rule computing method for transaction set
CN111858730A (en) Data importing and exporting device, method, equipment and medium of graph database
CN110598003A (en) Knowledge graph construction system and construction method based on public data resource catalog
CN114443656A (en) Customizable automated data model analysis tool and use method thereof
CN112905642B (en) Method for storing IEC61850 report data into relational database based on CSV mapping file
CN114139040A (en) Data storage and query method, device, equipment and readable storage medium
CN106874479A (en) The improved method and device of the FP Growth algorithms based on FPGA
CN110874366A (en) Data processing and query method and device
CN113779349A (en) Data retrieval system, apparatus, electronic device, and readable storage medium
CN114490809A (en) Data mining method and device based on feature transformation function and readable medium
CN106874498B (en) Financial data access method and access device
CN115794756A (en) Simulation file compression method and device, equipment and storage medium
CN112306421B (en) Method and system for storing MDF file in analysis and measurement data format
CN108228759A (en) Storage processing method, device, computer equipment and the storage medium of record set
CN113836211B (en) Data extraction method for accessing data of internet of things equipment to JAVA platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination