CN115618194A - Spark-based data processing method - Google Patents
Spark-based data processing method Download PDFInfo
- Publication number
- CN115618194A CN115618194A CN202211629246.6A CN202211629246A CN115618194A CN 115618194 A CN115618194 A CN 115618194A CN 202211629246 A CN202211629246 A CN 202211629246A CN 115618194 A CN115618194 A CN 115618194A
- Authority
- CN
- China
- Prior art keywords
- data
- index
- calculation
- calculated
- spark
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 12
- 238000004364 calculation method Methods 0.000 claims abstract description 53
- 238000000034 method Methods 0.000 abstract description 8
- 238000007781 pre-processing Methods 0.000 abstract description 2
- 239000008280 blood Substances 0.000 description 23
- 210000004369 blood Anatomy 0.000 description 23
- 230000000694 effects Effects 0.000 description 13
- 230000006870 function Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/18—Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Optimization (AREA)
- Mathematical Physics (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Mathematical Analysis (AREA)
- Computational Mathematics (AREA)
- Pure & Applied Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Operations Research (AREA)
- Algebra (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Software Systems (AREA)
- Evolutionary Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application provides a Spark-based data processing method, which can efficiently complete the processing process of mass data based on lower service resource occupancy rate. According to the technical scheme, a big data spark memory engine is adopted, and the calculation relation between the statistical data and the index parameters to be calculated, which participate in calculation, is decomposed in a preprocessing mode to obtain: and the single table operation relation and the integral operation relation are respectively calculated, index parameters to be calculated in each index table are firstly calculated, all the single table statistical results are spliced into a TempFile stored in an HDFS file system, the main body surface and the TempFile are associated, and the statistical result corresponding to each main body data is obtained through the integral operation relation.
Description
Technical Field
The invention relates to the technical field of data processing, in particular to a Spark-based data processing method.
Background
With the development of technology, the units of data are huge when data are calculated in many current applications. For example, for the data processing of log files and personnel information of a designated area of some large servers, the data stored in the related forms are in units of tens of millions or billions. In some calculations that involve single-table-associated multi-table, for example, when different subjects summarize data of some specified attributes, in a conventional database-based data processing method, a code is usually used to obtain data of a subject, and then the subject data is circulated to each specific table to query corresponding data, and a statistical result of each subject data is obtained through final calculation and judgment. Because the data tables are single tables, the data stored in each table is basically ten million or billion-level data, and the number of the tables can be dozens, the execution efficiency is very low, and the single-machine database can take months. The calculation is based on the table associations, and dozens of large table associations also generate huge Cartesian products to cause system resource exhaustion and possibly database downtime. Even using high performance databases, or big data technology multi-table associations, must occupy a large amount of server resources to meet efficiency requirements.
Disclosure of Invention
In order to solve the problem that the execution efficiency is too low when the traditional database-based computation problem of single-table-associated multi-table processing of mass data is solved, the application provides a Spark-based data processing method which can efficiently complete the processing process of the mass data based on low service resource occupancy rate.
The technical scheme of the invention is as follows: a Spark-based data processing method is characterized by comprising the following steps:
s1: constructing a Spark running environment and an HDFS file system;
s2: confirming all form data participating in calculation;
the form data includes: a body table and an index table;
the subject table and the index table are 1: n, wherein N is a natural number greater than 1;
all the main body data participating in calculation are recorded in the main body table;
each index table records different types of parameter indexes corresponding to the main data respectively;
s3: determining data needing statistics, and recording the data as: counting data;
determining parameters participating in statistical calculation in each index table, and recording the parameters as: index parameters to be calculated;
s4: determining a calculation relation between the statistical data and each index parameter to be calculated;
the calculating the relationship comprises: single table operational relationships and overall operational relationships;
the single table operation relationship is as follows: calculating each index to be calculated in the index table by participation, and outputting a single table statistical result corresponding to each index table; each index table corresponds to one single table operation relation;
the overall operational relationship is as follows: all the single-table statistical results need to participate in calculation, and the statistical results corresponding to each main data are output;
s5: loading all form data participating in calculation into the HDFS system, and constructing a temporary table in the HDFS file system: recording as a TempFile;
s6: and configuring the single-table operation relation and the integral operation relation into a Spark program, and respectively recording the single-table operation relation and the integral operation relation as follows: a single table calculation program and an overall calculation program;
s7: and respectively calculating all index parameters to be calculated in each index table based on the single table calculation program, and recording the obtained calculation result as: counting results by a single table;
s8: summarizing the single-table statistical result corresponding to each main data into a TempFile, and establishing an association relation with the main data;
s9: and respectively calculating the representative statistical result corresponding to each main data in the TempfFile based on the overall operational relationship to obtain the statistical result corresponding to each main data.
It is further characterized in that:
the relation between the index table and the index parameter to be calculated is 1: m, wherein M is a natural number greater than or equal to 1.
According to the Spark-based data processing method, a big data Spark memory engine is adopted, and the calculation relationship between the statistical data and the index parameters to be calculated, which participate in calculation, is decomposed in a preprocessing mode to obtain: calculating index parameters to be calculated in each index table respectively, splicing all the single table statistical results into a TempFile stored in an HDFS file system, associating a main body surface with the TempFile, and obtaining a statistical result corresponding to each main body data through the integral operation relation; in the whole calculation process, the correlation calculation for correlating a plurality of index tables together is only operated once, so that the calculation complexity is reduced, and the calculation efficiency is improved; meanwhile, in the method, the correlation calculation is performed in batch by means of the memory calculation of Spark, and the HDFS file system is used as the file system of Spark, so that the correlation calculation of multiple tables can be completed without occupying high service resources, and the operation efficiency is greatly improved.
Drawings
Fig. 1 is a schematic flow chart of a data processing method based on Spark in the present application.
Detailed Description
As shown in fig. 1, the present application includes a Spark-based data processing method, which includes the following steps.
S1: and constructing a Spark running environment and an HDFS file system.
The HDFS file system is set to the file system of Spark. In the technical scheme, based on a Spark parallel operation framework technology in the prior art, spark Core is used for driving scheduling, a unified RDD data structure is adopted for data processing, and Spark is connected to an HDFS file system.
S2: confirming all form data participating in calculation;
the form data includes: a main body table and an index table;
the body table and index table are 1: n, wherein N is a natural number greater than 1;
all the main data participating in calculation are recorded in the main body table;
different types of parameter indexes corresponding to the main data are recorded in each index table respectively;
the relation between the index table and the index parameter to be calculated is 1: m, wherein M is a natural number greater than or equal to 1; in specific application, the number of index parameters to be calculated in each index table is at least 1.
S3: determining data needing statistics, and recording the data as: counting data;
determining parameters participating in statistical calculation in each index table, and recording the parameters as: and index parameters to be calculated.
For example, in this embodiment, the public interest participation degree of natural people in a certain specified area is counted.
Then, the body table is set to: TABLE _ PERSON of natural PERSON form;
the index table includes: a blood donation form TABLE _ XIANXUE and a volunteer form TABLE _ ZHIYUANZHE;
wherein, in the blood donation form, the blood donation records of all natural persons are recorded, and each record comprises the following fields: time, name, identification card, blood donation ml;
in the volunteer form, records of all natural persons participating in volunteer activities are recorded, each record comprising fields: time, name, identification card, support item.
S4: determining a calculation relation between the statistical data and each index parameter to be calculated;
calculating the relationship includes: single table operational relationships and overall operational relationships;
the single table operation relationship is as follows: calculating each index to be calculated in the index table, and outputting a single table statistical result corresponding to each index table; each index table corresponds to a single table operation relation;
the overall operational relationship is as follows: all the single-table statistical results need to participate in calculation, and the statistical results corresponding to each main data are output.
The specific overall operational relationship and the single-table operational relationship are set according to actual requirements, and can be counting and interval value calculation, or calculation through a calculation model or function in the prior art. In this embodiment, for the public welfare participation degree statistics, instead of simply listing two activities, each natural person corresponds to a quantized numerical value, and the following are set: each natural person is marked by the index of social contribution degree to represent the public interest participation degree.
Then, in this embodiment, the statistical data is: social contribution degree;
the index parameters to be calculated are as follows: blood donation and volunteer activity times;
counting blood donation records through blood donation amount, and counting volunteer activities through participation times; in order to realize the unified measurement of two activities, the two index parameters are normalized based on a variable box separation method; such as: for blood donation statistics: 10 points are obtained when the blood donation is more than 500cc, and 5 points are obtained when the blood donation is less than 5 points, and the times of the volunteer activities are counted, wherein the times of the volunteer activities are more than 3 times and 20 points and less than 3 times and 10 points;
then, the overall calculation relationship is: social contribution = blood donation score + volunteer activity times score;
the method for calculating the blood donation score comprises the following steps: comparing the sum of the blood donations of each natural person with 500cc, and if the blood donations are more than 500cc, the blood donations are 10 points, otherwise, the blood donations are 5 points;
the volunteer activity frequency score calculation method comprises the following steps: the total number of times each natural person participated in the volunteer activities is marked with 3, if more than 3, 20 points are given, otherwise 10 points are given.
For each index table, a single table operation needs to be performed:
the blood donation TABLE TABLE _ XIANXUE corresponds to a single TABLE operation relationship: counting the sum of blood donations according to each natural person;
the TABLE _ ZHIYUANZHE of the volunteer corresponds to the following single TABLE operation relationship: the sum of the number of volunteer activities was counted for each natural person.
S5: loading all form data participating in calculation into the HDFS system, and constructing a temporary table in the HDFS file system: denoted as temp.
The method has the advantages that the number of data fields included in each form is large, when the method is loaded to an HDFS system, the whole form does not need to be loaded completely, only fields participating in calculation and key word fields of main data are loaded, and the method reduces occupation of system resources by only loading the form data participating in calculation.
The data fields that need to be loaded in the donation form TABLE _ XIANXUE are: identity card, blood donation ml;
the data fields that the volunteer form TABLE _ ZHIYUANZHE needs to load are: identification cards, support items.
S6: and configuring the single-table operation relation and the integral operation relation into a Spark program, and respectively recording the single-table operation relation and the integral operation relation as follows: single table calculation programs and whole calculation programs.
S7: based on a single table calculation program, calculating all index parameters to be calculated in each index table respectively, and recording the obtained calculation result as: and (5) counting results in a single table.
And configuring the overall calculation relationship and the single table operation relationship corresponding to each form into a Spark program. The fields loaded to the HDFS are calculated separately. Such as: and carrying out sum calculation on the blood donation milliliters according to the identity card, and carrying out number calculation on the support items to finally obtain the sum of the blood donation quantity corresponding to each identity card number and the sum of the activity times of the volunteers.
S8: and summarizing the single-table statistical result corresponding to each main data into a TempFile, and establishing an association relation with the main data.
Summarizing the sum of the blood donation amount corresponding to each identity card and the sum of the activity times of the volunteers into Tempfile, and establishing association with main data;
the following are recorded in Tempfile:
identity card, name, blood donation, volunteer;
320 x 123, zhang san, 600,5;
320 x 342, lie four, 1000, 2.
S9: and respectively calculating the representative statistical result corresponding to each main data in the TempfFile based on the overall operational relationship to obtain the statistical result corresponding to each main data.
Identity card, name, social contribution;
320 x 123, zhang san, 30;
320 x 342, lie four, 20.
After the technical scheme is used, statistical data determine to-be-calculated index parameters participating in statistical calculation in each index table, the to-be-calculated index parameters in each index table are loaded into an HDFS file system, the overall calculation relation between the statistical data and the to-be-calculated index parameters is decomposed, single-table operation relation corresponding to each index table is obtained, and the single-table operation relation is configured in a Spark engine; after each single index table is calculated, the single-table statistical results of each index table are unified to establish an association relation with the main data through the Tempfile of the temporary table, and the final statistical result is obtained through calculation based on the data in the Tempfile. According to the technical scheme, the final statistical result can be obtained through the association, calculation can be completed within hundreds of millions of hours by combining the big data memory calculation technology of the Spark engine, all calculation can be completed only based on single-machine server configuration, excessive service resources are not occupied, the efficiency is greatly improved, and the cost is saved.
Claims (2)
1. A Spark-based data processing method is characterized by comprising the following steps:
s1: constructing a Spark running environment and an HDFS file system;
s2: confirming all form data participating in calculation;
the form data includes: a body table and an index table;
the main body table and the index table are 1: n, wherein N is a natural number greater than 1;
all main body data participating in calculation are recorded in the main body table;
each index table records different types of parameter indexes corresponding to the main data respectively;
s3: determining data needing statistics, and recording the data as: counting data;
determining parameters participating in statistical calculation in each index table, and recording the parameters as: index parameters to be calculated;
s4: determining a calculation relation between the statistical data and each index parameter to be calculated;
the calculating the relationship comprises: single table operational relationships and overall operational relationships;
the single table operation relationship is as follows: calculating each index to be calculated in the index table, and outputting a single table statistical result corresponding to each index table; each index table corresponds to one single table operation relation;
the overall operational relationship is as follows: all the single-table statistical results need to participate in calculation, and the statistical results corresponding to the main data are output;
s5: loading all form data participating in calculation into the HDFS system, and constructing a temporary table in the HDFS file system: recording as a TempFile;
s6: and configuring the single-table operation relation and the integral operation relation into a Spark program, and respectively recording the single-table operation relation and the integral operation relation as follows: a single table calculation program and an overall calculation program;
s7: and respectively calculating all index parameters to be calculated in each index table based on the single table calculation program, and recording the obtained calculation result as: counting results by a single table;
s8: summarizing the single-table statistical result corresponding to each main data into a TempFile, and establishing an association relation with the main data;
s9: and respectively calculating the single-table statistical result corresponding to each main data in the TempfFile based on the overall operational relationship to obtain the statistical result corresponding to each main data.
2. A Spark-based data processing method as claimed in claim 1, wherein: the relation between the index table and the index parameter to be calculated is 1: m, wherein M is a natural number greater than or equal to 1.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211629246.6A CN115618194A (en) | 2022-12-19 | 2022-12-19 | Spark-based data processing method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211629246.6A CN115618194A (en) | 2022-12-19 | 2022-12-19 | Spark-based data processing method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115618194A true CN115618194A (en) | 2023-01-17 |
Family
ID=84879819
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211629246.6A Pending CN115618194A (en) | 2022-12-19 | 2022-12-19 | Spark-based data processing method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115618194A (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108959356A (en) * | 2018-05-07 | 2018-12-07 | 国网上海市电力公司 | A kind of intelligence adapted TV university Data application system Data Mart method for building up |
CN109473178A (en) * | 2018-11-12 | 2019-03-15 | 北京懿医云科技有限公司 | Method, system, equipment and the storage medium of medical data integration |
CN110209646A (en) * | 2019-05-14 | 2019-09-06 | 汇通达网络股份有限公司 | A kind of data platform system calculated based on real-time streaming |
-
2022
- 2022-12-19 CN CN202211629246.6A patent/CN115618194A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108959356A (en) * | 2018-05-07 | 2018-12-07 | 国网上海市电力公司 | A kind of intelligence adapted TV university Data application system Data Mart method for building up |
CN109473178A (en) * | 2018-11-12 | 2019-03-15 | 北京懿医云科技有限公司 | Method, system, equipment and the storage medium of medical data integration |
CN110209646A (en) * | 2019-05-14 | 2019-09-06 | 汇通达网络股份有限公司 | A kind of data platform system calculated based on real-time streaming |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8463822B2 (en) | Data merging in distributed computing | |
CN107247811B (en) | SQL statement performance optimization method and device based on Oracle database | |
CN113051291A (en) | Work order information processing method, device, equipment and storage medium | |
Ausloos et al. | Statistical dynamics of religions and adherents | |
CN112579586A (en) | Data processing method, device, equipment and storage medium | |
CN113553341A (en) | Multidimensional data analysis method, multidimensional data analysis device, multidimensional data analysis equipment and computer readable storage medium | |
CN114860780B (en) | Data warehouse, data processing system and computer device | |
CN109615172A (en) | A kind of method and terminal handling examination data | |
CN114741368A (en) | Log data statistical method based on artificial intelligence and related equipment | |
CN115599840A (en) | Complex service data management method and system | |
CN115938600A (en) | Mental health state prediction method and system based on correlation analysis | |
CN113641739B (en) | Spark-based intelligent data conversion method | |
CN112818000B (en) | Label library management and application method, system and computer equipment based on multi-label main body | |
Brusco et al. | Deterministic blockmodelling of signed and two‐mode networks: A tutorial with software and psychological examples | |
CN110502529B (en) | Data processing method, device, server and storage medium | |
CN115618194A (en) | Spark-based data processing method | |
CN114722789A (en) | Data report integration method and device, electronic equipment and storage medium | |
CN116089490A (en) | Data analysis method, device, terminal and storage medium | |
CN115147082A (en) | Special medicine-based insurance intelligent claim settlement system | |
CN112494933B (en) | Game data warehouse construction method and device | |
CN114860819A (en) | Method, device, equipment and storage medium for constructing business intelligent system | |
CN114155037A (en) | Work result visualization method and system | |
US20150081735A1 (en) | System and method for fast identification of variable roles during initial data exploration | |
CN113077227B (en) | Method and device for processing chat quantity of enterprise information portal group and electronic equipment | |
CN113704327B (en) | Data recording method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20230117 |
|
RJ01 | Rejection of invention patent application after publication |