CN109739818B

CN109739818B - Portable high-throughput big data acquisition method and system

Info

Publication number: CN109739818B
Application number: CN201811622024.5A
Authority: CN
Inventors: 张晨光
Original assignee: Inspur Software Co Ltd
Current assignee: Inspur Software Co Ltd
Priority date: 2018-12-28
Filing date: 2018-12-28
Publication date: 2021-04-02
Anticipated expiration: 2038-12-28
Also published as: CN109739818A

Abstract

The invention discloses a portable high-throughput big data acquisition method and a system, belonging to the field of data acquisition, aiming at solving the technical problem of instantly collecting and processing data of different data structures of different databases, and adopting the technical scheme that: a portable high throughput big data acquisition method is characterized in that a central server sends an instruction, each cluster server starts logstack, configuration parameters are read through an etcd component of datatrains, a response configuration file is automatically generated, the logstack reads the configuration file, various database data are collected according to the configuration file, the database data are sent to each server through Kafka in a message form after being sorted, and the Kafka and the logstack are matched at a consumption end to temporarily store the collected data according to corresponding formats; the central server calls the related components to process the collected related data. The invention also discloses a portable high-throughput large data acquisition system.

Description

Portable high-throughput big data acquisition method and system

Technical Field

The invention relates to the field of data acquisition, in particular to a convenient high-throughput large-data acquisition method and a system.

Background

The information is the core basis of decision making, and the timely and effective data acquisition and data processing is very important. Because of the dispersion of data and the inconsistency of structure, the data collection mode of the transmission is very time-consuming and labor-consuming. Therefore, how to instantly collect and process data of different data structures of different databases is a technical problem which needs to be solved urgently at present;

to solve the above technical problems, people are beginning to pay attention to enterprise data integration research. Data in different systems are attempted to be reprocessed, so that an integrated analysis-oriented environment is formed, and rules can be mined from massive information, knowledge can be extracted, and decision making can be assisted. In the prior art, data is extracted in a single-threaded polling database mode, but each table in the single-threaded polling database is adopted, so that the problems that time consumption is too large and data extraction efficiency is low due to the fact that the data size of some tables is too large may exist.

Patent No. CN106330963A discloses a cross-network multi-node log collection method, wherein node logs are sent to a headquarters, an application server needs to send the node logs of each day to the headquarters, an external network transmission mode is adopted, the server executes shell scripts regularly every day, and log files on the server are compressed and sent to the headquarters server; sending the log data from the outside network of the headquarters to the inside network of the headquarters, providing a log ferrying program and storing the log into an inside network database; and recovering the log data in the database into an original log file, recovering the log data in the headquarter intranet database into the original log file, and sending the original log file to a big data platform through a log management tool logstack. However, the technical scheme cannot realize the instantaneous collection and processing of data of different data structures of different databases.

Disclosure of Invention

The technical task of the invention is to provide a convenient high-throughput large-data acquisition method and a convenient high-throughput large-data acquisition system, so as to solve the problem of how to instantly collect and process data of different data structures of different databases.

The technical task of the invention is realized in the following way, a portable high-throughput big data acquisition method is characterized in that a central server sends an instruction, each cluster server starts Logstash, configuration parameters are read through an etcd component of datatrains, a response configuration file is automatically generated, the Logstash reads the configuration file, various database data are collected according to the configuration file, the configuration file is sent to each server in a message form through Kafka, and the Kafka and the Logstash are matched at a consumption end to temporarily store the collected data according to a corresponding format; the central server calls the related components to process the collected related data; the method comprises the following specific steps:

s1, the server generates different parameter message packets according to different service requirements, and sends the parameter message packets through timing tasks or manual operation;

s2, the consumption end analyzes the message packet after receiving the message packet sent by the server end, and corresponding parameters are obtained; data collection is carried out according to the parameters, and data collection is completed through a dtp transmission channel;

s3, writing the collected data into the csv file, compressing and encrypting the csv file, and producing a data collection compression packet;

s4, the consumer side uploads the data collection compression packet to the server side through FTP or SFTP mode (the two transmission modes mainly consider the problem of system compatibility);

s5, the server side decrypts and decompresses the data of the data collection compression packet to obtain a decompressed csv file;

s6, judging whether the decompressed csv file is a common table or a large table:

firstly, if the table is a common table, normal data insertion and deletion are carried out;

and secondly, if the table is large, inserting data by adopting gpload.

Preferably, in step S6, gpload is used for data insertion, and the specific method is as follows:

(1) java controls the shell instruction of a local linux server, and Java, lang, Runtime and the like package a runtime environment;

(2) each Java application program has a Runtime instance, so that the Java application program can be connected with the running environment of the Java application program;

(3) acquiring the reference of the current Runtime object of the Runtime by a getRuntime method;

(4) and obtaining the reference of a current Runtime object, and calling the method of the Runtime object to control the state and the behavior of the Java virtual machine.

Preferably, the data extraction in the data collection process in step S2 is completed by select query paging polling; each table in the database is polled in a multithreading manner during the data collection process.

Preferably, when data is extracted in the data collection process, each table in the database is disassembled into sql according to the index hit rule, and the data extraction efficiency is improved through the index hit rule processing.

Preferably, the method for multithread selection specifically comprises the following steps:

(1) and transmission control parameters: the thread and the paging size control the stability of the system, and the thread number is not less than 2;

(2) and the number of the opening threads: the number of the threads determines the system resources occupied by the extracted data, including db connection number and system io resources, and the range of the number of the threads is as follows: 1< thread _ num < 10;

(3) querying page size: the page size determines the size of the occupied heap and the io resource, the heap burst is caused by the control of two parameters of the size of the occupied heap and the io resource, and the range of the page size is as follows: 10000< page < 100000;

(4) and the operation and maintenance personnel actively adjust the performance parameters of the corresponding server according to the requirements.

Preferably, the Memory Leak (Memory Leak) or Memory Overflow (Memory Overflow) occurs after the occupied heap reaches the capacity limit of the maximum heap, and the specific determination method is as follows:

firstly, if the memory leaks, looking up a reference chain from a leaking object to GC Roots through a tool, and finding out a path through which the leaking object is related to the GC Roots and causing that a garbage collector cannot automatically recover garbage;

and secondly, if the memory overflows, the objects in the memory must live, checking whether the heap parameters (-Xmx and Xms) of the virtual machine can be increased compared with the physical memory of the machine, checking whether the life cycle of some objects is too long and the holding state time is too long from the code, and trying to reduce the memory consumption of the program in the operating period.

Preferably, the central server calls the related component to process the collected related data, and the specific steps are as follows:

initiating jar, polling 242 servers every 2 minutes;

(ii) session, getsttdout () acquires server file information;

(iii) judging whether the file is transmitted completely:

firstly, if the transmission is finished, pulling the compressed file from the remote place through ch.ethz.ssh2;

(iv), deleting 242 the media file by ssh;

(v), decrypting the decompressed file;

(vi) resolving the related parameters through file names;

(vii), deleting duplicate data from the GreenPlum database;

(viii), importing from the csv file to a GP database;

(ix), end.

A portable high-throughput big data acquisition system comprises a server side and a consumption side;

the server side is used for generating different parameter message packets according to different service requirements, sending the parameter message packets through a timing task or manual operation, and carrying out data decryption and decompression on the data collection compression packets to obtain decompressed csv files;

the consumption end is used for analyzing after receiving the message packet sent by the server end, acquiring corresponding parameters, then performing data collection according to the parameters, completing data collection through a dtp transmission channel, writing the collected data into a csv file, performing compression and encryption operations on the csv file, producing a data collection compression packet and uploading the data collection compression packet to the server end through an FTP or SFTP mode (the two transmission modes mainly consider the problem of system compatibility).

The portable high-throughput big data acquisition method and the system have the following advantages:

the invention uses open source middleware and data collection technique to manage multiple server clusters remotely, to realize central server control multiple distributed servers to process data collection and transmission of various data sources; the method comprises the following steps that all servers cooperatively complete an operation task, a central server acquires operation information of related servers and controls and monitors all servers in real time, the servers are remotely controlled by utilizing an open source middleware technology, data of all servers are extracted, cleaned, converted and stored, compressed and encrypted and then put into a data collection library, and therefore the data of a relational database can be conveniently and rapidly extracted and output according to a required format;

the initiation of the central service control flow can be carried out in batch or in fixed-point operation, the purposes of portability and high efficiency are achieved through some open-source components such as kafka, and the like, and the convenient and fast high-throughput data acquisition is realized;

(III) operation convenience: reading configuration parameters through an etcd component of datatrains, and automatically generating a response configuration file; the Logstash reads the configuration file, and the collection and the sending of data are completed automatically;

(IV) operation exhibition: theoretical and practical tests show that the kafka cluster has the performance advantage under the condition of large data volume, and the performance advantage is shown in the following table:

kafka is time consuming in case of large data volume

Difference of transmission environment	Total time (hours)	Total time (hours) for sql query	kafka pile-up (Peak)	Data volume
					kfka singleton	10.5	1.5	Within 3000	Node 1(3.3G)
kafka cluster	13.5	0.5	Within 3000	Node 2(3.3G)
					kafka cluster	12.7	1.4	Within 3000	Node 3(6.1G)

(V) starting and stopping the operation: executing the timed task circularly;

according to the invention, data are temporarily stored at the consumption end by matching kafka and logstash according to corresponding formats, the central server calls related components to process the collected related data, so that non-software developers who change data collection can easily operate the data collection, and the cost and time of data collection are greatly reduced.

Drawings

The invention is further described below with reference to the accompanying drawings.

FIG. 1 is a flow diagram of the present invention;

FIG. 2 is a schematic structural view of example 4.

Detailed Description

A portable high throughput big data acquisition method and system of the present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

Example 1:

the invention relates to a portable high-throughput big data acquisition method, which is characterized in that a central server sends an instruction, each cluster server starts Logstash, reads configuration parameters through an etcd component of datatrains, automatically generates a response configuration file, reads the configuration file through Logstash, collects various database data according to the configuration file, sends the data to each server in a message form through Kafka after arrangement, and temporarily stores the collected data according to a corresponding format at a consumption end by matching the Kafka with the Logstash; the central server calls the related components to process the collected related data; the method comprises the following specific steps:

s2, the consumption end analyzes the message packet after receiving the message packet sent by the server end, and corresponding parameters are obtained; data collection is carried out according to the parameters, and data collection is completed through a dtp transmission channel; data extraction in the data collection process is completed through select query paging polling; each table in the database is polled in a multithreading manner during the data collection process. When data are extracted in the data collection process, each table in the database is disassembled into sql according to the hit rule of the index, and the data extraction efficiency is improved through the processing of the index hit rule. The multithreading selection method specifically comprises the following steps:

(3) querying page size: the page size determines the size of the occupied heap and the io resource, the heap burst is caused by the control of two parameters of the size of the occupied heap and the io resource, and the range of the page size is as follows: 10000< page < 100000; memory Leak (Memory Leak) or Memory Overflow (Memory Overflow) occurs after the occupied heap reaches the maximum heap capacity limit, and the specific determination method is as follows:

secondly, if the table is large, inserting data by adopting gpload, and inserting data by adopting gpload, wherein the specific method comprises the following steps:

(4) obtaining the reference of a current Runtime object, and calling the method of the Runtime object to control the state and behavior of the Java virtual machine;

code to implement a data insertion scheme with gpload:

example 2:

as shown in fig. 1, the provincial bureau, the national bureau and the 101 server are taken as examples. The task starting mode includes two modes: firstly, a local timing task is saved; and receiving the mq dispatching of the national bureau.

The specific steps for starting the data acquisition task by the provincial bureau are as follows:

(1) the provincial bureau starts a data collection task through a timing task or receiving the scheduling of the national bureau mq;

(2) data compression and encryption;

(3) sending a message to a central server;

(4) timer polls for five minutes and inquires the transmission permission from the national bureau;

(5) whether a national bureau transmission permission is obtained:

firstly, if the transmission permission is obtained, the Ftp uploads the data to a 242 storage server of a national bureau;

(6) judging whether the uploading is successful:

if uploading is successful, sending the mp message to a national bureau, and after receiving the mp message, modifying the file state and feeding the file state back to the 242 storage server.

The specific steps of the server 101 for processing data are as follows:

initiating jar, polling 242 servers every 2 minutes;

(ii) session, getsttdout () acquires server file information;

(iii) judging whether the file is transmitted completely:

(iv) 242 storage Server for deleting 242 media files by ssh and feeding them back to the State office

(v), decrypting the decompressed file;

(vi) resolving the related parameters through file names;

(vii), deleting duplicate data from the GreenPlum database;

(viii), importing from the csv file to a GP database;

(ix), end.

Example 3:

the invention relates to a quick high-throughput big data acquisition system, which comprises a server side and a consumption side; the server side is used for generating different parameter message packets according to different service requirements, sending the parameter message packets through a timing task or manual operation, and carrying out data decryption and decompression on the data collection compression packets to obtain decompressed csv files; the consumption end is used for analyzing after receiving the message packet sent by the server end, acquiring corresponding parameters, then performing data collection according to the parameters, completing data collection through a dtp transmission channel, writing the collected data into a csv file, performing compression and encryption operations on the csv file, producing a data collection compression packet and uploading the data collection compression packet to the server end through an FTP or SFTP mode (the two transmission modes mainly consider the problem of system compatibility).

Example 4: take information center, business company, industrial company as examples:

as shown in FIG. 2, the information center includes an industry supervision platform, a marketing big data platform and a GreenPlum database. The business company includes a marketing platform and a DB2/Oracle database. The industrial company includes an analysis command platform and a greenplus database.

The data transmission steps from the business company to the information center are as follows:

(1) polling DB2/Oracle database;

(2) the Logstash reads the configuration file and collects data to obtain a data file-1;

(3) the data file-1 is encrypted and compressed by a pushing module and then pushed to a transmission channel;

(4) the information center receives the data transmitted by the transmission channel;

(5) the reading module decrypts the decompressed data and the like to the data file-1;

(6) and storing the data file-1 to a GreenPlum database.

The data transmission steps from the information center to the industrial company are as follows:

(1) a GreenPlum database of the polling information center;

(2) the Logstash reads the configuration file, collects the sending data and obtains a data file-2;

(3) the data file-2 is encrypted and compressed by the pushing module and then pushed to the transmission channel;

(4) the industrial company receives the data transmitted by the transmission channel;

(5) the reading module decrypts the decompressed data and the like to a data file-2;

(6) and storing the data file-2 to a GreenPlum database of an industrial company.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A portable high throughput big data acquisition method is characterized in that a central server sends an instruction, each cluster server starts Logstash, configuration parameters are read through an etcd component of datatrains, a response configuration file is automatically generated, the Logstash reads the configuration file, various database data are collected according to the configuration file, the database data are sent to each server through Kafka in a message form after being sorted, and the Kafka and the Logstash are matched at a consumption end to temporarily store the collected data according to a corresponding format; the central server calls the related components to process the collected related data; the method comprises the following specific steps:

s2, the consumption end analyzes the message packet after receiving the message packet sent by the server end, and corresponding parameters are obtained; data collection is carried out according to the parameters, and data collection is completed through a dtp transmission channel; data extraction in the data collection process is completed through select query paging polling; each table in the multithreading polling database is adopted in the data collection process; when data are extracted in the data collection process, disassembling sql of each table in the database according to the hit rule of the index; the multithreading selection method specifically comprises the following steps:

(2) and the number of the opening threads: the number of the threads determines to extract system resources occupied by the data, the system resources comprise db connection number and system io resources, and the range of the number of the threads is as follows: 1< thread _ num < 10;

(3) querying page size: the page size determines the size of the occupied heap and the amount of the occupied io resources; the range of page sizes is: 10000< page < 100000;

(4) the operation and maintenance personnel actively adjust the performance parameters of the corresponding servers according to the requirements;

s4, the consumer side uploads the data collection compression packet to the server side in an FTP or SFTP mode;

(1) if the table is a common table, normal data insertion and deletion are carried out;

(2) and if the table is large, inserting data by adopting gpload.

2. The portable high-throughput big data collection method according to claim 1, wherein gploid is used for data insertion in step S6, and the specific method is as follows:

3. The portable high-throughput big data collection method according to claim 1, wherein the central server invokes a related component to process the collected related data, and the specific steps are as follows:

starting jar, polling the central server every 2 minutes;

(ii) session, getsttdout () acquires server file information;

(iii) judging whether the file is transmitted completely:

if the transmission is finished, pulling the compressed file from the remote place through ch.ethz.ssh2;

(iv) after the transmission is finished, deleting the media file through ssh;

(v), decrypting the decompressed file;

(vi) resolving the related parameters through file names;

(vii), deleting duplicate data from the GreenPlum database;

(viii), importing from the csv file to a GP database;

(ix), end.

4. A portable high-throughput big data acquisition system is characterized by comprising a server side and a consumption side;

the server side is used for generating different parameter message packets according to different service requirements, sending the parameter message packets through a timing task or manual operation, and carrying out data decryption and decompression on the data collection compression packets to obtain decompressed csv files; and simultaneously judging whether the decompressed csv file is a common table or a large table: firstly, if the table is a common table, normal data insertion and deletion are carried out; secondly, if the table is large, inserting data by adopting gpload;

the consumption end is used for analyzing after receiving the message packet sent by the server end, acquiring corresponding parameters, then performing data collection according to the parameters, completing data collection through a dtp transmission channel, writing the collected data into a csv file, performing compression and encryption operations on the csv file, producing a data collection compression packet and uploading the data collection compression packet to the server end in an FTP or SFTP mode; wherein, the data extraction in the data collection process is completed by select query paging polling; each table in the multithreading polling database is adopted in the data collection process; when data are extracted in the data collection process, disassembling sql of each table in the database according to the hit rule of the index; the multithreading selection method specifically comprises the following steps:

(2) and the number of the opening threads: the number of the threads determines the system resources occupied by the extracted data, and the system resources comprise db connection number and system io resources; the range of the number of threads is: 1< thread _ num < 10;