CN113806451A - Data division processing method and device, electronic equipment and storage medium - Google Patents

Data division processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113806451A
CN113806451A CN202111092757.4A CN202111092757A CN113806451A CN 113806451 A CN113806451 A CN 113806451A CN 202111092757 A CN202111092757 A CN 202111092757A CN 113806451 A CN113806451 A CN 113806451A
Authority
CN
China
Prior art keywords
file
data
service
preset
server
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111092757.4A
Other languages
Chinese (zh)
Inventor
邢雨濛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Puhui Enterprise Management Co Ltd
Original Assignee
Ping An Puhui Enterprise Management Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Puhui Enterprise Management Co Ltd filed Critical Ping An Puhui Enterprise Management Co Ltd
Priority to CN202111092757.4A priority Critical patent/CN113806451A/en
Publication of CN113806451A publication Critical patent/CN113806451A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the artificial intelligence technology, provide a data and divide processing method, strip the data to be processed in hive according to preserving the dimension at first, and then put the file of the particular format converted in advance into in the file server preserved through the temporary directory of hive, then obtain the file of the particular format from the file server through the pre-arranged business system preserved to the file of the particular format in the pre-arranged business server in order to form the file data of the business; in the embodiment, the pressure can be relieved for hive by processing small batches of data through the preposed business system and the file server, so that hive can process 'big data' in a real sense more intensively, the data processing efficiency is greatly improved, and further, manpower and time are saved.

Description

Data division processing method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of artificial intelligence, and in particular, to a data partitioning method and apparatus, an electronic device, and a computer-readable storage medium.
Background
In an insurance business system, the current big data architecture is that all business data existing at present are synchronized to a hive database in a data synchronization mode (hive is a sql-like data warehouse tool for storing and analyzing the current big data). The developer needs to distinguish the data by taking the service as a dimension, and the data of each service should belong to the range of the service table, and integrate all the data established from the insurance service system.
At present, all report forms or display data platforms need to perform data analysis by being developed in a hive, and then synchronize data to a business system, generally, operations such as data calculation and analysis are performed in a hive library, and the obtained results are synchronized to the business system in a query mode, but the scheme provided by the hive is to query corresponding desired data by class sql, and a script is written, wherein month and 100W data are generally taken as dimensions (for example, 500W data exist in the month, and 5 files need to be generated by taking 100W as the dimensions), but when files are divided by taking 100W as the dimensions, full-amount MapReduce (statistical analysis data function) needs to be performed on 100W data each time, so that the speed is low, the time is long, and the efficiency is seriously influenced.
Therefore, a data partitioning method capable of increasing the data processing rate, reducing the processing time and saving the labor is needed.
Disclosure of Invention
The invention provides a data division processing method, which aims to solve the problems that all reports or display data platforms at present need to analyze data by being developed in hive and then synchronize the data into a service system, data calculation, analysis and other operations are generally performed in a hive library, obtained results are synchronized to the service system in a query mode, the scheme provided by hive is that the corresponding required data is queried through class sql, a script is written by generally taking month and 100W data as dimensions, but when files are divided by taking 100W as the dimensions, the full amount of MapReduce needs to be performed on 100W data each time, so that the speed is low, the time is long, and the efficiency is seriously influenced.
In order to achieve the above object, a data partitioning processing method provided by the present invention includes:
stripping data to be processed in hive according to preset dimensionality, and carrying out format conversion on the data to be processed to form a specific format file;
placing the file with the specific format in a preset file server through the temporary directory of the hive;
acquiring the file with the specific format from the file server through a preset front-end service system, and storing the file with the specific format into a preset service server to form service file data; the preset preposed service system is pre-connected with the file server;
and splitting the service file data in the service server to form a file group, and storing the file group to finish data division processing.
Optionally, the stripping the data to be processed in hive according to a preset dimension, and performing format conversion on the data to be processed to form a specific format file, includes:
fishing data to be processed in the hive according to a preset dimension, and marking the data to be processed;
according to the mark, stripping and exporting the data to be processed from the hive through a preset export statement;
carrying out format setting of a specific format on the data to be processed which is stripped and derived from the hive;
and converting the format of the data to be processed based on the specific format to form a specific format file.
Optionally, the placing the specific format file in a preset file server through the temporary directory of the hive includes:
dividing the specific format file into a preset number of sub-files;
collecting the subfiles into a subfile set;
placing the subfile sets under a temporary directory of the hive to form a hive temporary file;
and sending the hive temporary file to a file server pre-connected with the temporary directory through a sending instruction.
Optionally, the aggregating the subfiles into a subfile set includes:
acquiring a specific format file to which the subfile belongs;
naming a primary signature for the specific format file, and acquiring the sequence of subfiles divided by the specific format file;
according to the sequence of the divided subfiles, naming sequence numbers for the subfiles, and adding a first-level signature of the file with the specific format to the sequence numbers to form a label name of each subfile;
and arranging the subfiles according to the marked names to form a subfile list, and moving the subfile list to a blank folder to form a subfile set.
Optionally, the obtaining, by a preset front-end service system, the file in the specific format from the file server, and saving the file in the specific format to a preset service server to form service file data includes:
connecting a pre-service system preset in advance with the file server; the preposed business system is an independent system and is used for reading files in the file server;
acquiring the subfile set in the file server through the preposed business system;
reforming the subfile sets to restore the subfile sets into files with specific formats;
and transmitting the file with the specific format to a service server connected with the file server.
Optionally, the method further includes storing the specific format file to form a service file; wherein, the storing the file with the specific format to form a service file comprises:
the business server is externally connected with a business database;
and storing the files with the specific formats into the external service database in batches through a dump plug-in to form service files.
Optionally, the splitting the service file data in the service server to form a file group, and performing another storage on the file group to complete data partitioning, including:
traversing and reading the service file to form service data;
adding a time identifier for each service data according to a time field preset in the service database;
classifying and summarizing the service data according to the time identification through a preset data fishing program, and corresponding the service data of each class to a data fishing command corresponding to the class;
generating a processing file for the service data corresponding to the data fishing command;
splitting the processing files according to the preset quantity to form a file group;
and storing the file group to finish data dividing processing.
In order to solve the above problem, the present invention further provides an efficient data partitioning processing apparatus, including:
the format specific unit is used for stripping data to be processed in hive according to preset dimensionality and converting the format of the data to be processed to form a specific format file;
the file external unit is used for placing the file with the specific format in a preset file server through the temporary directory of the hive;
the service file unit is used for acquiring the file with the specific format from the file server through a preset front service system and storing the file with the specific format into a preset service server to form service file data; the preset preposed service system is pre-connected with the file server;
and the data dividing unit is used for splitting the service file data in the service server to form a file group and storing the file group to finish data dividing.
In order to solve the above problem, the present invention also provides an electronic device, including:
a memory storing at least one instruction; and
and the processor executes the instructions stored in the memory to realize the steps in the data division processing method.
In order to solve the above problem, the present invention further provides a computer-readable storage medium, in which at least one instruction is stored, and the at least one instruction is executed by a processor in an electronic device to implement the data partitioning processing method described above.
The method comprises the steps of firstly stripping data to be processed in a hive according to preset dimensionality, carrying out format conversion on the data to be processed to form a file with a specific format, then placing the file with the specific format in a preset file server through a temporary directory of the hive, then obtaining the file with the specific format from the file server through a preset preposed service system, and storing the file with the specific format in a preset service server to form service file data; in the embodiment, the pressure can be relieved for hive by processing small batches of data through the preposed business system and the file server, so that hive can process 'big data' in a real sense more intensively, the data processing efficiency is greatly improved, and further, manpower and time are saved.
Drawings
Fig. 1 is a schematic flow chart of a data partitioning method according to an embodiment of the present invention;
fig. 2 is a block diagram of a data partitioning apparatus according to an embodiment of the present invention;
fig. 3 is a schematic internal structural diagram of an electronic device implementing a data partitioning method according to an embodiment of the present invention;
the implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In an insurance business system, the current big data architecture is to synchronize all business data to a hive database (hive is a sql-like data warehouse tool for storing and analyzing the current big data) by using the current data of oracle and often in a data synchronization mode. The developer needs to distinguish the data by taking the service as a dimension, and the data of each service should belong to the range of the service table, and integrate all the data established from the insurance service system.
In the process of processing data, the developer has the following data processing requirements: at present, all report or display data platforms need to perform data analysis by being developed in a hive and then synchronize data to a service system, generally, data calculation, analysis and other operations are performed in a hive library, and then obtained results are synchronized to the service system in a query mode, but some systems also need to synchronize metadata (i.e. original data which does not pass through calculation) to the service system, and when the data volume is large, the metadata volume reaches the level of tens of millions.
Currently, the problems with achieving the above data processing requirements are as follows: the scheme provided by hive is to query the corresponding desired data through the class sql, and write a script with month and preset amount of data as dimensions, but in practice, it is found that when a file is divided in an azikaban running task with 100W as dimensions, the full amount of MapReduce (statistical analysis data function) needs to be performed again every 100W of data, which results in extremely slow efficiency, and the generation of a data file of 1 month requires 2 or more hours, which seriously affects the processing efficiency.
In order to solve the above problems, the present invention provides a data partitioning processing method, and it should be noted that in the embodiments of the present application, related data can be acquired and processed based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.
The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
As shown in fig. 1, in this embodiment, the data division processing method includes:
s1: stripping data to be processed in hive according to preset dimensionality, and carrying out format conversion on the data to be processed to form a file with a specific format;
s2: placing the file with the specific format in a preset file server through a hive temporary directory;
s3: acquiring the file with the specific format from a file server through a preset preposed service system, and storing the file with the specific format into a preset service server to form service file data; wherein, the preset preposed service system is pre-connected with the file server;
s4: and splitting the service file data in the service server to form a file group, and storing the file group to finish data splitting.
In the embodiment shown in fig. 1, step S1 is a process of stripping data to be processed in hive according to a preset dimension, and performing format conversion on the data to be processed to form a file with a specific format, and includes:
s11: fishing data to be processed in the hive according to a preset dimension, and marking the data to be processed; the preset dimension is within a specific time limit, for example, data of a certain year in hive is acquired as data to be processed;
s12: stripping and exporting the data to be processed from the hive through a preset export statement according to the mark; the step directly peels off the data to be processed from the hive, so as to avoid the condition that the splitting in the hive influences the data processing efficiency as the prior art; namely, select user, company, license, time from dw _ trans _ detail;
s13: carrying out format setting of a specific format on the data to be processed which is stripped and derived from the hive;
s14: and converting the format of the data to be processed based on the specific format to form a specific format file.
Wherein the preset export statement is an sql statement; the specific format file may be any file suitable for a file server, and in consideration of the characteristics that the csv file can store data in an unlimited amount and does not lose data, in a specific embodiment of the present invention, the specific format is defined as the csv format file. The files with csv suffixes are adopted to support unlimited data storage, and the total data storage capacity can be effectively improved.
Specifically, step S11 is a process of marking data, and a specific marking manner is not limited, and in this embodiment, the retrieved data is marked by adding a point, that is, a ". quadrature" is added behind the data; no specific limitation is imposed on what is the data to be processed, and in this embodiment, the data required for performing the service operation is taken as the data to be processed;
step S12 is a process of exporting data to be processed, that is, unlike the prior art, the problem that data processing efficiency is affected by directly dividing data in a hive table in the prior art is solved;
step S13 is a process of acquiring a preset format, in this embodiment, a csv format is set, that is, a content format of a file is set, including an encoding format utf-8, a data separator ",", a data line-feed symbol "\\ r \ n", and the like;
step S14 is a process of converting the data to be processed into a file with a specific format, that is, completely converting the exported data to be processed into a file with a csv format; in this embodiment, it is:
insert overwrite local directory"/tmp/out/"
row format delimited fields terminated by","
select user,company,license,time from dw_trans_detail。
in the embodiment shown in fig. 1, step S2 is a process of placing a file with a specific format in a preset file server through the temporary directory of hive, and includes:
s21: dividing the file with the specific format into a preset number of sub-files; the method is a process for segmenting the file with the specific format, so that the sub-files obtained by splitting can generate data at the same time, and the processing efficiency is improved;
s22: collecting the subfiles into a subfile set;
s23: placing the subfile sets under a temporary directory of hive to form a hive temporary file; the method comprises the steps that data to be processed are stripped in hive to form a preset number of sub-files, the sub-files are in a csv format, then sub-file sets formed by the sub-files are placed into a temporary directory of the hive, the temporary directory is temporarily created and does not belong to the original hive in the actual sense;
s24: and sending the hive temporary file to a file server pre-connected with the temporary directory through a sending instruction.
Specifically, the preset number in step 21 is not limited, and may be determined according to actual conditions, in this embodiment, the preset number is 10, that is, the specific format file is divided into 10 sub-files;
step S22 is a process of summarizing the subfiles, and a specific summarizing manner is not limited, and in this embodiment, the process of summarizing the subfiles into a subfile set includes:
s221: acquiring a specific format file to which the subfile belongs;
s222: naming a primary signature for the file with the specific format, and acquiring the sequence of subfiles divided by the file with the specific format;
s223: according to the sequence of the divided subfiles, naming sequence numbers for the subfiles, and adding a primary signature of the file with the specific format to the sequence numbers to form a label name of each subfile;
s224: arranging the sub-files according to the marked names to form a sub-file list, and moving the sub-file list to a blank folder to form a sub-file set;
step S23 is a process of forming a hive temporary file, that is, the subfile set is placed in the temporary directory, and the subfile set can be sent from the hive to a file server by sending a command, in this embodiment, the file server is an sftp file server;
step S24 is a process of sending the hive temporary file to a file server pre-connected to the temporary directory, and before step S24, the process of establishing a connection between the hive server and the file server is further included, after the connection between the hive server and the file server is established, the temporary file can be transmitted to the file server through the temporary directory, in other words, the subfile set is placed under the temporary directory only for a short time, the temporary directory is used as a medium between the hive and the file server, the data is stripped from the hive, the data is immediately transmitted to the file server after being temporarily stored for a short time, so as to perform a subsequent process of dividing the data through the file server, thus solving the problems that the conventional data division has operations of data calculation, analysis and the like in a hive library, and the obtained result is synchronized to a service system in a query manner, resulting in a low rate, often times longer, seriously affecting efficiency.
In the embodiment shown in fig. 1, step S3 is a process of acquiring the specific format file in the file server through a preset front-end service system, and saving the specific format file in a preset service server to form service file data, and includes:
s31: connecting a pre-service system preset in advance with a file server; the front-end business system is an independent system and is used for reading files in the file server; in the process of connection, after a front-end service system sends a connection request to a file server, the file server receives a request from the front-end service system, firstly, a request address carried by the connection request is read, whether the connection request meets the connection requirement is judged according to the request address, if the connection requirement is met, the file server sends a connection receipt to the front-end service system, and a connection port is opened to be connected with the front-end service system;
s32: acquiring a subfile set in a file server through a front-end business system;
s33: reforming the sub-file set to restore the sub-file set into a file with a specific format;
s34: transmitting the file with the specific format to a service server connected with a file server; wherein, the service server is used for executing related services according to the specific format file; the step is to transmit the required data to be processed and the form of the specific format to a service system;
after the file with the specific format is transmitted to a service server connected with the file server, the method further comprises the following steps:
the process of storing and processing the specific format file to form a service file comprises the following steps:
s351: the business server is externally connected with a business database; in this embodiment, the service database is a nas disk;
s352: and storing the files with the specific formats into the external service database in batches through a dump plug-in to form service files.
Specifically, step S31 is a process of connecting the front-end service system with the file server in advance, in the connection process, the front-end service system is preset first, the front-end service system sends a connection request to the file server, the file server receives the request from the front-end service system, reads a request address carried by the connection request, determines whether the connection request meets the connection requirement according to the request address, if not, the file server sends a connection unavailable receipt to the front-end service system, and the front-end service system continues to send the connection request to the file server according to the receipt until the connection requirement is met;
step S32 is a process of acquiring a subfile set in a file server through the pre-business system, and when acquiring a subfile set, first, a reading plug-in the pre-business system acquires a character string based on the subfile set based on a transmission channel formed by the pre-business system and the file server, and then acquires subfile data according to the character string, and forms a subfile set according to the subfile data;
step S33 is a process of performing a reforming process on the sub-file set to restore the sub-file set to a specific format file, and in more detail, the sub-file set is divided according to the specific format file, so in this step, the sub-files in the divided sub-file set need to be reformed and summarized to form a reformed file, and the reformed file needs to be format-converted to restore the specific format file;
step S34 is a step of transmitting the file with the specific format to a service server, where the service server is a process of processing data in a security system to form various service data, and this step is to transmit the required data to be processed and the form of the specific format to the service system;
in this embodiment, the service database is a nas disk, which is a hardware device connected to the service server, so that a large capacity can be provided for the service server to store a large amount of files with a specific format;
in addition, the files of the file server need to be cleaned regularly: although the capacity of the file server is large, long-term accumulation can cause insufficient memory, and unnecessary data needs to be cleaned regularly to release part of the memory.
In the embodiment shown in fig. 1, step S4 is a process of performing a splitting process on service file data in a service server to form a file group, and performing additional storage on the file group to complete a data dividing process, and includes:
s41: traversing and reading the service file to form service data;
s42: adding a time identifier for each service data according to a time field preset in the service database;
s43: classifying and summarizing the service data according to the time identification through a preset data fishing program, and corresponding the service data of each class to a data fishing command corresponding to the class; the time identification comprises year, month, day and time division, namely classification and division can be carried out according to the dimensionality of the same year, the same month, the same day and the like, so that the processing process of data division is completed;
s44: generating a processing file for the service data corresponding to the data fishing command;
s45: splitting the processing files according to the preset quantity to form a file group; the preset number can be any number, for example, x ten thousand, that is, if a processed file in a month contains 1000 ten thousand of data, the processed file is divided into 1000 ÷ x file groups by taking x ten thousand as a dimension;
in the splitting process, splitting the processed files by a java program in an amount of x ten thousand data, setting a mark initial value before splitting, namely an initial numerical value of a mark value, adding 1 to the mark value every time when a row of the processed files to be split is imported into a new file group, and when the mark value reaches x ten thousand, newly building another file group by taking the mark value as a cycle until all the processed files are split;
s46: performing additional storage on the file group to finish data division processing;
specifically, step S41 is a process of reading a service file to form service data, in this process, the service file is first traversed sequentially, and data extraction is performed on the service file in the traversing process to form big data; arranging the big data in sequence to form service data;
step S42, firstly, the time field is set in the service database in advance, and then when one piece of service data is input into the service database, the time mark is automatically distributed to the service data;
step S43 is a process of classifying and summarizing the service data according to the time identifier, and in this embodiment, the data fetching program takes a month as a unit category, that is, the service data of the same month is summarized, that is, each month is a category, and the service data of each month is corresponding to the data fetching command corresponding to the month, and the data fetching command is also preset in advance and is used for summarizing the service data of the same category (the same month);
step S44 is a process of generating a processing file based on the service data, where the processing file is the last version file before splitting, that is, the data processing is completed by directly splitting and storing the processing file;
step S45 is a process of splitting the processed file according to a preset number to form file groups, where in this embodiment, the preset number is 100 ten thousand, that is, if the processed file in a month includes 800 ten thousand of data, the processed file is split into 8 file groups with 100 ten thousand as a dimension;
in the splitting process, splitting the processed files by a java program in 100 ten thousand data size, marking an initial value before splitting, adding 1 to a marking value every time when a row of the processed files needing splitting is imported into a new file group, and creating another file group when the splitting is finished when the marking value reaches 100 ten thousand, wherein the new file group is used as a cycle until all the processed files are split.
In addition, it should be noted that, when the service server reads a million-level file, data is read to a memory space of the JVM first, but the large amount of data generally results in insufficient JVM space, so in this embodiment, all data is not read when splitting is performed, one line of data is read from the file each time, and after the current line of data is processed, the space is released to read the next line of data; in the embodiment, the processing file adopts an easy excel tool provided by Alibama to efficiently process a big data file; moreover, based on the principle of expansibility, compatibility is required, if a processing logic of a file is newly added later, a set of logic does not need to be re-developed, and a table, a field and a service type need to be extracted for configuration, so that the requirement of expansion can be met; when a certain processing file is abnormal, the execution of other files cannot be influenced, the abnormal file is brought into a preset abnormal log, and the abnormal log is recorded by a database table, so that developers can check problems in time.
As described above, in the data partitioning processing method provided by the present invention, firstly, data to be processed is stripped in hive according to a preset dimension, format conversion is performed on the data to be processed to form a file with a specific format, then the file with the specific format is placed in a preset file server through a temporary directory of the hive, then the file with the specific format is obtained from the file server through a preset front-end service system, and the file with the specific format is stored in a preset service server to form service file data; in the embodiment, the pressure can be relieved for hive by processing small batches of data through the preposed business system and the file server, so that hive can process 'big data' in a real sense more intensively, the data processing efficiency is greatly improved, and further, manpower and time are saved.
As described above, in the embodiment shown in fig. 1, the data partitioning processing method provided by the present invention has the following advantages: the MapReduce analysis data is very resource-consuming in the hive, if the data volume is small, the MapReduce is occupied when being executed for multiple times, and other tasks are influenced, but the pressure can be relieved for the hive by processing small-batch data through the front-end business system and the file server, so that the hive can process 'big data' in a real sense; improving data processing efficiency: practice shows that the traditional technology takes more than 2 hours to process data of 1 month on hive, while the embodiment issues part of logic to the preposed service system, the total data of hive processing is about 10 minutes, the split file in the preposed service system is about 5 minutes, and the processing efficiency is improved by 75%; shortening the processing time and saving the labor.
As shown in fig. 2, the present invention provides an efficient data partitioning processing apparatus 100, which can be installed in an electronic device. According to the implemented functions, the efficient data division processing device 100 may include a format specific unit 101, a file external unit 102, a service file unit 103, and a data division unit 104. The module of the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.
In the present embodiment, the functions regarding the respective modules/units are as follows:
the format specific unit 101 is used for stripping data to be processed in hive according to preset dimensionality and converting the format of the data to be processed to form a specific format file;
the file external unit 102 is used for placing the file with the specific format into a preset file server through a temporary directory of hive;
the service file unit 103 is configured to acquire a file with a specific format from a file server through a preset front-end service system, and store the file with the specific format into a preset service server to form service file data; the preset preposed service system is pre-connected with the file server;
the data dividing unit 104 is configured to split the service file data in the service server to form a file group, and store the file group separately to complete the data dividing process.
For a specific implementation method, reference may be made to the description of the relevant steps in the embodiment corresponding to fig. 1, which is not described herein again.
As described above, the efficient data partitioning processing apparatus provided by the present invention firstly strips the data to be processed in the hive according to the preset dimension through the format specifying unit 101, performs format conversion on the data to be processed to form a specific format file, places the specific format file in the preset file server through the temporary directory of the hive by using the file external unit 102, then obtains the specific format file from the file server through the preset front-end service system by using the service file unit 103, and stores the specific format file in the preset service server to form service file data; wherein, this preset preposition business system and file server are pre-connected, and then carry out the split processing to business file data in the business server through data partitioning unit 104 in order to form the file group, and carry out the storage in addition in order to accomplish the data partition processing to the file group, in this embodiment, go to handle the data of small batch through preposition business system and file server and can alleviate pressure for hive, thereby let hive can be more concentrated on the processing truly "big data", improve data processing efficiency greatly, and then save manpower and time.
In the embodiment shown in fig. 2, the efficient data partitioning processing apparatus provided by the present invention has the following advantages: the MapReduce analysis data is very resource-consuming in the hive, if the data volume is small, the MapReduce is occupied when being executed for multiple times, and other tasks are influenced, but the pressure can be relieved for the hive by processing small-batch data through the front-end business system and the file server, so that the hive can process 'big data' in a real sense; improving data processing efficiency: practice shows that the traditional technology takes more than 2 hours to process data of 1 month on hive, while the embodiment issues part of logic to the preposed service system, the total data of hive processing is about 10 minutes, the split file in the preposed service system is about 5 minutes, and the processing efficiency is improved by 75%; shortening the processing time and saving the labor.
As shown in fig. 3, the present invention provides an electronic device 1 implementing a data division processing method.
The electronic device 1 may comprise a processor 10, a memory 11 and a bus, and may further comprise a computer program, such as an efficient data partitioning handler 12, stored in the memory 11 and executable on said processor 10.
The memory 11 includes at least one type of readable storage medium, which includes flash memory, removable hard disk, multimedia card, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only to store application software installed in the electronic device 1 and various types of data, such as codes for efficient data division processing, etc., but also to temporarily store data that has been output or is to be output.
The processor 10 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device 1 by running or executing programs or modules (e.g., efficient data partitioning process programs, etc.) stored in the memory 11 and calling data stored in the memory 11.
The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like.
Fig. 3 shows only an electronic device with components, and it will be understood by those skilled in the art that the structure shown in fig. 3 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than those shown, or some components may be combined, or a different arrangement of components.
For example, although not shown, the electronic device 1 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so as to implement functions of charge management, discharge management, power consumption management, and the like through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device 1 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
Further, the electronic device 1 may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used for establishing a communication connection between the electronic device 1 and other electronic devices.
Optionally, the electronic device 1 may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 1 and for displaying a visualized user interface, among other things.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
The efficient data partitioning handler 12 stored in the memory 11 of the electronic device 1 is a combination of instructions that, when executed in the processor 10, enable:
stripping data to be processed in hive according to preset dimensionality, and carrying out format conversion on the data to be processed to form a file with a specific format;
placing the file with the specific format in a preset file server through a hive temporary directory;
acquiring the file with the specific format from a file server through a preset preposed service system, and storing the file with the specific format into a preset service server to form service file data; wherein, the preset preposed service system is pre-connected with the file server;
and splitting the service file data in the service server to form a file group, and storing the file group to finish data splitting.
Specifically, the specific implementation method of the processor 10 for the instruction may refer to the description of the relevant steps in the embodiment corresponding to fig. 1, which is not described herein again. It should be emphasized that, in order to further ensure the privacy and security of the above-mentioned efficient data partitioning process, the data of the above-mentioned efficient data partitioning process is stored in the node of the block chain where the server cluster is located.
The server may be an independent server, or may be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), and a big data and artificial intelligence platform.
Further, the integrated modules/units of the electronic device 1, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. The computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).
An embodiment of the present invention further provides a computer-readable storage medium, where the storage medium may be nonvolatile or volatile, and the storage medium stores a computer program, and when the computer program is executed by a processor, the computer program implements:
stripping data to be processed in hive according to preset dimensionality, and carrying out format conversion on the data to be processed to form a file with a specific format;
placing the file with the specific format in a preset file server through a hive temporary directory;
acquiring the file with the specific format from a file server through a preset preposed service system, and storing the file with the specific format into a preset service server to form service file data; wherein, the preset preposed service system is pre-connected with the file server;
and splitting the service file data in the service server to form a file group, and storing the file group to finish data splitting.
Specifically, the specific implementation method of the computer program when being executed by the processor may refer to the description of the relevant steps in the data partitioning processing method in the embodiment, which is not repeated herein.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.
The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. A data partitioning processing method is characterized by comprising the following steps:
stripping data to be processed in hive according to preset dimensionality, and carrying out format conversion on the data to be processed to form a specific format file;
placing the file with the specific format in a preset file server through the temporary directory of the hive;
acquiring the file with the specific format from the file server through a preset front-end service system, and storing the file with the specific format into a preset service server to form service file data; the preset preposed service system is pre-connected with the file server;
and splitting the service file data in the service server to form a file group, and storing the file group to finish data division processing.
2. The data partitioning processing method according to claim 1, wherein the stripping the data to be processed in hive according to a preset dimension and performing format conversion on the data to be processed to form a specific format file comprises:
fishing data to be processed in the hive according to a preset dimension, and marking the data to be processed;
according to the mark, stripping and exporting the data to be processed from the hive through a preset export statement;
carrying out format setting of a specific format on the data to be processed which is stripped and derived from the hive;
and converting the format of the data to be processed based on the specific format to form a specific format file.
3. The data partitioning processing method according to claim 1, wherein said placing the specific format file in a preset file server through the temporary directory of hive comprises:
dividing the specific format file into a preset number of sub-files;
collecting the subfiles into a subfile set;
placing the subfile sets under a temporary directory of the hive to form a hive temporary file;
and sending the hive temporary file to a file server pre-connected with the temporary directory through a sending instruction.
4. The data partitioning processing method according to claim 3, wherein said aggregating said subfiles into a subfile set comprises:
acquiring a specific format file to which the subfile belongs;
naming a primary signature for the specific format file, and acquiring the sequence of subfiles divided by the specific format file;
according to the sequence of the divided subfiles, naming sequence numbers for the subfiles, and adding a first-level signature of the file with the specific format to the sequence numbers to form a label name of each subfile;
and arranging the subfiles according to the marked names to form a subfile list, and moving the subfile list to a blank folder to form a subfile set.
5. The data partitioning processing method according to claim 4, wherein the obtaining the specific format file in the file server through a preset pre-service system, and saving the specific format file in a preset service server to form service file data comprises:
connecting a pre-service system preset in advance with the file server; the preposed business system is an independent system and is used for reading files in the file server;
acquiring the subfile set in the file server through the preposed business system;
reforming the subfile sets to restore the subfile sets into files with specific formats;
and transmitting the file with the specific format to a service server connected with the file server.
6. The data partitioning processing method according to claim 5, further comprising performing storage processing on said specific format file to form a service file; wherein, the storing the file with the specific format to form a service file comprises:
the business server is externally connected with a business database;
and storing the files with the specific formats into the external service database in batches through a dump plug-in to form service files.
7. The data partitioning method according to claim 1, wherein the splitting the service file data in the service server to form a file group, and storing the file group separately to complete the data partitioning process includes:
traversing and reading the service file to form service data;
adding a time identifier for each service data according to a time field preset in the service database;
classifying and summarizing the service data according to the time identification through a preset data fishing program, and corresponding the service data of each class to a data fishing command corresponding to the class;
generating a processing file for the service data corresponding to the data fishing command;
splitting the processing files according to the preset quantity to form a file group;
and storing the file group to finish data dividing processing.
8. A data partitioning processing apparatus, characterized in that the apparatus comprises:
the format specific unit is used for stripping data to be processed in hive according to preset dimensionality and converting the format of the data to be processed to form a specific format file;
the file external unit is used for placing the file with the specific format in a preset file server through the temporary directory of the hive;
the service file unit is used for acquiring the file with the specific format from the file server through a preset front service system and storing the file with the specific format into a preset service server to form service file data; the preset preposed service system is pre-connected with the file server;
and the data dividing unit is used for splitting the service file data in the service server to form a file group and storing the file group to finish data dividing.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the steps of the data partitioning processing method as claimed in any one of claims 1 to 7.
10. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements a data partitioning processing method according to any one of claims 1 to 7.
CN202111092757.4A 2021-09-17 2021-09-17 Data division processing method and device, electronic equipment and storage medium Pending CN113806451A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111092757.4A CN113806451A (en) 2021-09-17 2021-09-17 Data division processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111092757.4A CN113806451A (en) 2021-09-17 2021-09-17 Data division processing method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113806451A true CN113806451A (en) 2021-12-17

Family

ID=78939608

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111092757.4A Pending CN113806451A (en) 2021-09-17 2021-09-17 Data division processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113806451A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103778148A (en) * 2012-10-23 2014-05-07 阿里巴巴集团控股有限公司 Life cycle management method and equipment for data file of Hadoop distributed file system
JP2016218540A (en) * 2015-05-15 2016-12-22 日本電気株式会社 Data conversion device, data conversion system, data conversion method, and data conversion program
CN110569298A (en) * 2019-09-12 2019-12-13 成都中科大旗软件股份有限公司 data docking and visualization method and system
CN110866006A (en) * 2019-10-12 2020-03-06 苏宁云计算有限公司 Method and device for archiving expired data
CN110968582A (en) * 2019-11-01 2020-04-07 苏宁云计算有限公司 Crowd generation method and device
CN111737235A (en) * 2020-08-12 2020-10-02 国网浙江省电力有限公司营销服务中心 Heterogeneous data migration method for power industry

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103778148A (en) * 2012-10-23 2014-05-07 阿里巴巴集团控股有限公司 Life cycle management method and equipment for data file of Hadoop distributed file system
JP2016218540A (en) * 2015-05-15 2016-12-22 日本電気株式会社 Data conversion device, data conversion system, data conversion method, and data conversion program
CN110569298A (en) * 2019-09-12 2019-12-13 成都中科大旗软件股份有限公司 data docking and visualization method and system
CN110866006A (en) * 2019-10-12 2020-03-06 苏宁云计算有限公司 Method and device for archiving expired data
CN110968582A (en) * 2019-11-01 2020-04-07 苏宁云计算有限公司 Crowd generation method and device
CN111737235A (en) * 2020-08-12 2020-10-02 国网浙江省电力有限公司营销服务中心 Heterogeneous data migration method for power industry

Similar Documents

Publication Publication Date Title
CN112507027B (en) Kafka-based incremental data synchronization method, device, equipment and medium
CN111813963B (en) Knowledge graph construction method and device, electronic equipment and storage medium
CN106599104A (en) Mass data association method based on redis cluster
CN112506486A (en) Search system establishing method and device, electronic equipment and readable storage medium
CN112115145A (en) Data acquisition method and device, electronic equipment and storage medium
CN114185895A (en) Data import and export method and device, electronic equipment and storage medium
CN113177090A (en) Data processing method and device
CN114979120A (en) Data uploading method, device, equipment and storage medium
CN115237857A (en) Log processing method and device, computer equipment and storage medium
CN115858488A (en) Parallel migration method and device based on data governance and readable medium
CN115048111A (en) Code generation method, device, equipment and medium based on metadata
CN114398346A (en) Data migration method, device, equipment and storage medium
CN113468175A (en) Data compression method and device, electronic equipment and storage medium
CN112948380A (en) Data storage method and device based on big data, electronic equipment and storage medium
CN112214602A (en) Text classification method and device based on humor, electronic equipment and storage medium
CN115496166A (en) Multitasking method and device, electronic equipment and storage medium
CN113806451A (en) Data division processing method and device, electronic equipment and storage medium
CN113434397B (en) Task system testing method and device, electronic equipment and storage medium
CN114691782A (en) Database table increment synchronization method and device and storage medium
CN115114297A (en) Data lightweight storage and search method and device, electronic equipment and storage medium
CN114490137A (en) Service data real-time statistical method and device, electronic equipment and readable storage medium
CN115145870A (en) Method and device for positioning reason of failed task, electronic equipment and storage medium
CN114398282A (en) Test script generation method, device, equipment and storage medium
CN114722789A (en) Data report integration method and device, electronic equipment and storage medium
CN112560416A (en) Page chart generation method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination