CN112214453B

CN112214453B - Large-scale industrial data compression storage method, system and medium

Info

Publication number: CN112214453B
Application number: CN202010961819.XA
Authority: CN
Inventors: 高响
Original assignee: Shanghai Weiyi Intelligent Manufacturing Technology Co ltd; Changzhou Weiyizhi Technology Co Ltd
Current assignee: Shanghai Weiyi Intelligent Manufacturing Technology Co ltd; Changzhou Weiyizhi Technology Co Ltd
Priority date: 2020-09-14
Filing date: 2020-09-14
Publication date: 2021-10-01
Anticipated expiration: 2040-09-14
Also published as: CN112214453A

Abstract

The invention provides a large-scale industrial data compression storage method, a system and a medium, comprising the following steps: step 1: configuring different data acquisition systems according to the types of the data sources, and extracting data acquired by the data acquisition systems through an interface operation; step 2: defining a conversion chain, and temporarily converting the formats of the extracted different types of data into an Avro format through a data cleaning plug-in; and step 3: and compressing the data in the Avro format by using a GPL protocol, wherein the compression format is snappy, creating a data set with the queue as a storage format in the distributed file system, and storing the compressed data. The invention can define the conversion chain and the compression and storage format for any type of data, and greatly improves the data processing speed and the data compression ratio of the computing platform.

Description

Large-scale industrial data compression storage method, system and medium

Technical Field

The invention relates to the technical field of data compression and storage, in particular to a large-scale industrial data compression and storage method, system and medium.

Background

With the rapid development of new infrastructure, more and more traditional industrial enterprises are beginning to increase productivity by means of internet technology, with data being the most critical. In the traditional internet, large data processing has more and more data, and many enterprises can back up 2 pieces of data. This results in wasted disks.

Patent document CN108304472A (application No. 201711455790.2) discloses a data compression storage method and a data compression storage apparatus, the data compression method including the steps of: a segmentation step, in which original data is segmented into a plurality of fields; and a compression step, based on different data contents, adopting different compression strategies to compress different fields and storing compressed data. According to the data compression storage method and the data compression storage device, different compression methods can be adopted in consideration of different data contents, the data compression efficiency can be effectively improved, and the data compression rate is obviously improved compared with the data compression tools such as the general GZIP and SNAPPY.

Disclosure of Invention

In view of the defects in the prior art, the invention aims to provide a large-scale industrial data compression storage method, a large-scale industrial data compression storage system and a large-scale industrial data compression storage medium.

The large-scale industrial data compression and storage method provided by the invention comprises the following steps:

step 1: configuring different data acquisition systems according to the types of the data sources, and extracting data acquired by the data acquisition systems through an interface operation;

step 2: defining a conversion chain, and temporarily converting the formats of the extracted different types of data into an Avro format through a data cleaning plug-in;

and step 3: and compressing the data in the Avro format by using a GPL protocol, wherein the compression format is snappy, creating a data set with the queue as a storage format in the distributed file system, and storing the compressed data.

Preferably, the step 1 comprises:

step 1.1: classifying the data source according to a data format and a storage medium, wherein the data format comprises structured data and unstructured data, and the storage medium comprises Kafka and Rabbitmq;

step 1.2: and selecting a corresponding data acquisition system through a software configuration management tool, wherein Kafka corresponds to a Kafka data source selector, and Rabbitmq corresponds to a Rabbitmq data source selector.

Preferably, said step 2 of converting the data into an Avro format comprises: the industrial data maps the Avro formatted set of database objects and generates temporary Avro formatted data.

Preferably, the industrial data mapping Avro-formatted database object set comprises the following steps:

step 2.1: defining a conversion chain by configuring a field required to be output and an input field;

step 2.2: and configuring an interceptor component of the data acquisition system, intercepting data, preloading a database object set in an Avro format during data conversion, and injecting the database object set into a header file.

Preferably, the industrial data generating the temporary Avro format data includes the following steps:

step 2.3: the data acquisition system receives industrial equipment log events, sends the industrial equipment log events to a data export assembly of the data acquisition system, converts the industrial equipment log events into records and transmits the records to ReadLine, the ReadLine extracts log lines and data pipelines, uses a regular expression for matching, and sends the records to each line of input streams, and the lines are used as character strings and put into messages to output fields;

step 2.4: and configuring a Flume interceptor, intercepting the database object set with the Avro format, and converting the generated database object set into temporary Avro format data.

Preferably, the step 3 comprises:

step 3.1: generating a JSON file of a data set partition, wherein the partition is used for storing data and processing the data based on time inquiry and an enterprise ID;

step 3.2: and defining a data set according to the uniform resource identifier and the set of the database objects, and creating or specifying the data set by the data management platform according to a create command, wherein the data set comprises a uniform resource locator of the data set, a set of specified database objects and a partition field JSON.

Preferably, the step of generating the data set partition policy JSON file includes:

step 3.1.1: specifying partition fields and types;

step 3.1.2: a partition JSON storage path is designated;

step 3.1.3: and submitting a command for generating the partition strategy JSON.

Preferably, the data set is identified by a uniform resource identifier;

and acquiring the address and the storage mode of the stored data through the uniform resource identifier.

The large-scale industrial data compression storage system provided by the invention comprises:

module M1: configuring different data acquisition systems according to the types of the data sources, and extracting data acquired by the data acquisition systems through an interface operation;

module M2: defining a conversion chain, and temporarily converting the formats of the extracted different types of data into an Avro format through a data cleaning plug-in;

module M3: and compressing the data in the Avro format by using a GPL protocol, wherein the compression format is snappy, creating a data set with the queue as a storage format in the distributed file system, and storing the compressed data.

According to the present invention, a computer-readable storage medium is provided, in which a computer program is stored, which, when being executed by a processor, carries out the steps of the method as described above.

Compared with the prior art, the invention has the following beneficial effects:

1. the method adopts the Flume as a data pipeline to connect each data source of the industrial data platform, and adopts Morphlines to reduce the time and energy required for constructing and changing the ETL flow processing application program of the data, only needs to pay attention to business logic, carries out configuration operation through configuration files, and can extract, convert and load the data into a distributed storage system such as an HDFS (Hadoop distributed file system) without writing complex code programs;

2. the problem that JSON data can not be directly converted into a request format when stored in hdfs is solved by adopting a DataSet data set, the DataSet specifies the data formats to be a column-type storage format and a snapshot compression format when the data set is created, the compression ratio of the size of snapshot compressed data reaches 30% -40%, the compression and decompression rates reach 180M/1s and 430M/1s respectively, and the landing efficiency of the data and the utilization rate of a disk are greatly improved;

3. according to the method, data of messages such as kafka and the like of an industrial data platform are docked through the flash, the data are processed and landed through the flash, are stored into a queue format and are compressed by snap, only one copy of the data is stored, the consistency of the data is guaranteed through flash, when the data are landed, the flash can perform rollback operation through a self transaction mechanism, and a code writing mode is not adopted, so that the working time of developers is greatly reduced, the working efficiency is improved, and the resource utilization rate is increased.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.

Example (b):

referring to fig. 1, the large-scale industrial data compression and storage method provided by the invention comprises:

and (3) industrial data extraction: configuring different FlumeSource according to different data sources, and realizing configurable universal setting by performing interface operation on the FlumeSource configuration;

data temporary preload is Avro step: defining a conversion chain, configuring Schame of an Avro data format, and configuring Morpthline to temporarily convert different types of data formats into data of the Avro format;

create Dataset step: creating a data set with the sequence as a storage format in Hdfs through the Dataset, compressing data by a GPL protocol, and declaring that the final landing data is in the sequence format and a snappy compression format;

the combined operation steps are as follows: the steps are connected and operated through flash configuration, and finally, data are stored in a distributed storage system in a queue format through compression and preprocessing of a large amount of data in different formats.

The step of universal interfacing configuration FlumeSource comprises the following steps:

step A1: data stored by the industrial data processing platform is classified according to data format and storage media, including structured data and unstructured data, and the storage media include Kafka and Rabbitmq data storage media.

Step A2: through a Flume configuration management tool, a corresponding Flume resource is selected, Kafka corresponds to a Kafka data source selector, and Rabbitmq corresponds to a Rabbitmq data source selector.

The step of temporarily preloading data into Avro comprises the following steps: the industrial data maps Schame of Avro data and generates temporary Avro format data.

The Schame step of mapping the industrial data to Avro data comprises the following steps:

step B1: by configuring the fields that need to be output and the input fields to define a transformation chain, the transformation chain can use any type of data from any type of data source.

Step B2: the flumeInterceptor is configured to intercept data before streaming to the next step, preload the AvroSchame when the data is transformed by extraction, and inject a pattern into the header file so that the AvroEventSerrializer can pick it up.

The step of generating temporary Avro-formatted data from the industrial data comprises:

step C1: the fluorine receives the industrial device log events and sends them to the fluorine morphinesink, which converts each fluorine event into a record and passes it to the readLine command through the pipe. The readLine command extracts log lines and data pipes, uses regular expression pattern matching, sends one record per line in the input stream, and the line is put into a message output field as a character string.

Step C2: and configuring a Flume interceptor, intercepting the data after the step B2, generating structured or unstructured data into temporary Avro-format data by matching with Schame of Avro, and flowing the temporary Avro-format data into a FileChannel for further processing.

The creating Dataset step includes:

step D1: a dataset partition JSON file is generated, a dataset being a collection of records, similar to a relational database table. The records are similar to the table rows, but the columns may contain not only strings or numbers, but also nested data structures, such as lists, maps and other records, create a create command to partition primarily using datasets, may define partitioning policies such as date _ time: year, date _ time: month, date _ time: day by year, month and day, and partition data _ time by month and day. The partitions define logical partitions for data storage. Time-based queries are most often used to process data. When using data after 7/14/2020, Hadoop only needs to access the data/year-2020/month-7/day-14 stored in the partition. By using partitions corresponding to the most common queries, the application may run faster, increasing data computation efficiency and commit resource utilization.

Step D2: to create a data set, at least the URI and schema are required to define the data set. The data management platform creates or specifies a data set through a create command, mainly comprising url of the data set, a specified schedule and a partition field JSON, wherein the data storage is in a partial format, the schedule is defined in the step B2, and the partition JSON is generated in the step D1. The data set is identified by the URI. The created URI tells how and where to store the data. Dataset created using URI HDFS:/user/2020/7/14/then data is finally stored/user/2020/7/14/in the directory of HDFS. The created data set finally generates a metadata folder in the Hdfs, wherein a schema and a descriptor are arranged below the folder, and the descriptor file contain a compressed format of snap, a data format of request, a data storage path and a partition field.

The invention realizes the following functions:

1) the problem that a large amount of codes need to be compiled and operation and maintenance deployment codes need to be solved by compiling the configuration file to define the data conversion process;

2) the data set for storing the data is created in advance, and the data is temporarily converted into the data in the avro format, so that the flow of processing the data by borrowing spark is solved, and the utilization rate of computing resources and the data processing flow are saved;

3) by presetting dataset partition fields and automatically partitioning according to field contents in data, the problems that data needs to be stored in an isolated mode among different enterprises and the subsequent data analysis and calculation efficiency are solved.

The invention carries out data access, data circulation and data storage through the configuration interface. The method has extremely high compression and storage efficiency, greatly improves the utilization rate of storage resources and computing resources, mostly adopts spark for data processing and storage in the mainstream technology of storing data in the request format, needs additional computing resources and data processing components, needs different code development and deployment aiming at different industrial data, and is complicated in maintenance and development. The calculation time of the same data sample is improved by about 5 times through the subsequent analysis and calculation of the data with the format.

Those skilled in the art will appreciate that, in addition to implementing the systems, apparatus, and various modules thereof provided by the present invention in purely computer readable program code, the same procedures can be implemented entirely by logically programming method steps such that the systems, apparatus, and various modules thereof are provided in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system, the device and the modules thereof provided by the present invention can be considered as a hardware component, and the modules included in the system, the device and the modules thereof for implementing various programs can also be considered as structures in the hardware component; modules for performing various functions may also be considered to be both software programs for performing the methods and structures within hardware components.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims

1. A large-scale industrial data compression storage method is characterized by comprising the following steps:

and step 3: compressing data in an Avro format by a GPL protocol, wherein the compression format is snappy, creating a data set in a distributed file system, and the data set takes queue as a storage format, and storing the compressed data;

the method comprises the steps that through adopting the flash as a data pipeline to connect each data source of an industrial data platform, and through adopting Morphlins, the time required for constructing and changing an ETL (extract transform Loading) stream processing application program of data is reduced, only business logic needs to be concerned, configuration operation is carried out through a configuration file, and then the data is extracted, converted and loaded into a distributed storage system such as an HDFS (Hadoop distributed file system);

by adopting the DataSet data set, the DataSet specifies the data format as a column storage format and a snappy compression format when the data set is created;

docking data of a kafka message middleware of an industrial data platform through flash, processing the data to land only through the flash, storing the data into a queue format and compressing the data by using snap, only storing one copy of the data, ensuring the consistency of the data through flash filechannel, and when the data is in a land, performing rollback operation on the flash through a self transaction mechanism without a code writing mode;

the step 1 comprises the following steps:

step 1.2: selecting a corresponding data acquisition system through a software configuration management tool, wherein Kafka corresponds to a Kafka data source selector, and Rabbitmq corresponds to a Rabbitmq data source selector;

said step 2 converts the data into an Avro format, comprising: mapping the industrial data to an Avro-formatted database object set and generating temporary Avro-formatted data;

the industrial data mapping Avro format database object set comprises the following steps:

step 2.2: configuring an interceptor component of the data acquisition system, intercepting data, pre-loading a database object set in an Avro format during data conversion, and injecting the database object set into a header file;

the industrial data generation method generates data in a temporary Avro format, and comprises the following steps:

2. The large-scale industrial data compression storage method according to claim 1, wherein the step 3 comprises:

3. The large-scale industrial data compression storage method according to claim 2, wherein the step of generating a data set partition strategy JSON file comprises:

step 3.1.1: specifying partition fields and types;

step 3.1.2: a partition JSON storage path is designated;

4. The large-scale industrial data compression storage method according to claim 2, wherein the data set is identified by a uniform resource identifier;

5. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 4.