CN110515893B

CN110515893B - Data storage method, device, equipment and computer readable storage medium

Info

Publication number: CN110515893B
Application number: CN201910684136.1A
Authority: CN
Inventors: 潘利杰
Original assignee: Jinan Inspur Data Technology Co Ltd
Current assignee: Jinan Inspur Data Technology Co Ltd
Priority date: 2019-07-26
Filing date: 2019-07-26
Publication date: 2022-12-09
Anticipated expiration: 2039-07-26
Also published as: CN110515893A

Abstract

The embodiment of the invention discloses a data storage method, a data storage device, data storage equipment and a computer readable storage medium. The method comprises the steps of creating a hive table for reading a Protobuf serialized file stored in a bottom layer; generating a corresponding programming language file from the description file of the Protobuf serialized data, and sending the corresponding programming language file to a file package to be loaded; configuring fields for analyzing data in the hive table according to an analyzing mode corresponding to a pre-constructed data analyzer; based on a data analyzer, automatically analyzing and reading Protobuf serialized data by using a hive table; the data parser is used for matching the hive table with the configuration file through the table mode and the configuration structure, and generating an Object set and a hive result Object set of the hive table structure. The method and the device for analyzing the protocol buf file storage on the bottom layer of the hive data warehouse realize the analysis problem of the protocol buf file storage on the bottom layer of the hive data warehouse with the least development amount, and are beneficial to improving the data storage safety of the hive data warehouse.

Description

Data storage method, device, equipment and computer readable storage medium

Technical Field

The present invention relates to the field of storage technologies, and in particular, to a data storage method, apparatus, device, and computer-readable storage medium.

Background

With the rapid development of big data and cloud computing, a large amount of data in the current information-based society continuously emerges, and especially the data volume of companies such as the internet industry and the telecommunication industry continuously increases at a remarkable speed, so that the big data volume has higher requirements on storage. At present, data compression is almost adopted in the industry for storing big data to reduce the occupied space of file storage, but higher computing frequency is needed while reducing the file storage space, and the balance between computing and storage is balanced, so that the balance required by the industry can be achieved among resources, which is a problem that needs attention by technicians in the field.

Because the hive data warehouse has many advantages, such as expandable computing power, higher data fault tolerance, data security, all advantages of the integrated HDFS, low cost, simplicity and easy use, the hive data warehouse has been widely applied to application scenarios with the demand of the offline data warehouse as a main data warehouse.

However, due to the limited expression capability of the HSQL of the hive data warehouse, the generated mapreduce operation is not intelligent enough, the tuning granularity is coarse, and the like. The current hive data warehouse supports the storage data formats of textfile, sequencefile, rcfile, orcfile and part, but the data storage formats are all storage types with unsafe data, and an illegal intruder or an unauthorized user can check all the data of the taken part of data as long as taking part of the data of the type, so that the leakage is easy, and the safe data storage of the user is not facilitated.

Disclosure of Invention

The embodiment of the disclosure provides a data storage method, a data storage device, data storage equipment and a computer-readable storage medium, which solve the problem of analysis of bottom-layer Protobuf file storage of a hive data warehouse with the least development amount and are beneficial to improving the data storage safety of the hive data warehouse.

In order to solve the above technical problems, embodiments of the present invention provide the following technical solutions:

an embodiment of the present invention provides a data storage method, including:

creating a hive table, wherein the hive table is used for reading data in a Protobuf structure data storage format stored in a bottom layer;

generating a corresponding programming language file from the description file of the Protobuf serialized data, and sending the corresponding programming language file to a file package to be loaded;

configuring fields for analyzing data in the hive table according to an analyzing mode corresponding to a pre-constructed data analyzer;

automatically analyzing and reading the Protobuf serialized data by utilizing the hive table based on the data analyzer;

the data parser is used for matching the hive table with the configuration file in the table mode and the configuration structure and generating an Object set and a hive result Object set of the hive table structure.

Optionally, when applied to a Java language environment, the construction process of the data parser includes:

setting a pattern matching file for reading a configuration file and associating the hive table structure so as to match the hive table with the configuration file in a table pattern and configuration structure;

setting an Object conversion file for realizing conversion logic from the Java Object to the Object based on the subclass agent;

and setting a nested traversal file for generating Java objects by traversing the Protobuf serialized data by using the nested objects.

Optionally, the automatically parsing and reading the Protobuf serialization data by using the hive table based on the data parser includes:

generating an Object set of the hive table structure by rewriting initialization functions of a hive data warehouse to assemble the pattern matching file and the Object conversion file;

generating a hive result object set by assembling the nested traversal files by rewriting data analysis functions of a hive data warehouse;

generating an executable file package, and setting the position of a java file generated by a Protobuf structure definition file in the configuration file so as to load the executable file package into a hive environment variable and use the hive table to specify an entry for analyzing and reading the Protobuf serialized data;

and executing a hieveSQL query statement to read the Protobuf serialized data.

judging whether information of successful data analysis reading is received within preset time;

and if not, alarming the data reading error, and feeding back log file information of the Protobuf serialization data reading process through automatic analysis.

Another aspect of embodiments of the present invention provides a data storage device, including:

the Hive table creating module is used for creating a Hive table, and the Hive table is used for reading data in a Protobuf structure data storage format stored in a bottom layer;

the data conversion module is used for generating a corresponding programming language file from the description file of the Protobuf serialized data and sending the corresponding programming language file to a file package to be loaded;

the analysis mode configuration module is used for configuring fields used for analyzing data in the hive table according to an analysis mode corresponding to a preset constructed data analyzer; the data parser is used for matching the hive table with a configuration file through a table mode and a configuration structure, and generating an Object set and a hive result Object set of the hive table structure;

and the data automatic analysis reading module is used for automatically analyzing and reading the Protobuf serialized data by utilizing the hive table based on the data analyzer.

Optionally, the data parser comprises a pattern matcher, an object converter and a nested traversing device;

the pattern matcher is used for matching the hive table with a configuration file to obtain a table pattern and a configuration structure;

the Object converter is used for realizing conversion logic from the Java Object to the Object based on the subclass agent;

the nested traversing device is used for traversing the Protobuf serialized data by using a nested object to generate a Java object.

Optionally, the data automatic analysis reading module includes:

the automatic Object generation sub-module is used for generating an Object set of the hive table structure by rewriting initialization functions of a hive data warehouse so as to assemble the pattern matching file and the Object conversion file;

the hive result object set generation submodule is used for generating a hive result object set by assembling the nested traversal files by rewriting the data analysis function of the hive data warehouse;

the analysis mode specifying submodule is used for generating an executable file package, setting the position of a java file generated by a Protobuf structure definition file in the configuration file, and loading the executable file package into a hive environment variable to use the hive table to specify an inlet for analyzing and reading the Protobuf serialized data;

and the reading submodule is used for executing a hieveSQL query statement to read the Protobuf serialized data.

Optionally, the system further comprises an alarm module, configured to alarm a data reading error if the information that the data analysis and reading are successful is not received within a preset time, and feed back log file information of the process of automatically analyzing and reading the Protobuf serialized data.

An embodiment of the present invention further provides a data storage device, which includes a processor, and the processor is configured to implement the steps of the data storage method according to any one of the foregoing items when executing the computer program stored in the memory.

Finally, an embodiment of the present invention provides a computer-readable storage medium, in which a data storage program is stored, and the data storage program, when executed by a processor, implements the steps of the data storage method according to any one of the foregoing items.

The technical scheme provided by the application has the advantages that based on the pre-constructed data analyzer, the automatic analysis of the hive data warehouse can be realized and the Protobuf serialized files stored at the bottom layer can be read only by configuring the data analysis field of the hive table, and as the data stored in the Protobuf structural data storage format has a data compression function and high safety, the data storage safety of the hive data warehouse is improved; in addition, only some analysis fields need to be modified, original program codes do not need to be modified, the problem of analysis of the bottom-layer Protobuf file storage of the hive data warehouse is solved with the least development amount, the change of the data structure can be adapted to the lower run sequence only by changing the configuration file when the structure of data transmitted on the upper layer is changed, and the risk caused by the change of the application program of the application layer generated by an enterprise for the change of the data structure is reduced to a great extent.

In addition, the embodiment of the invention also provides a corresponding implementation device, equipment and a computer readable storage medium for the data storage method, so that the method has higher practicability, and the device, the equipment and the computer readable storage medium have corresponding advantages.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the related art, the drawings required to be used in the description of the embodiments or the related art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a schematic flowchart of a data storage method according to an embodiment of the present invention;

FIG. 2 is a flow chart illustrating another data storage method according to an embodiment of the present invention;

FIG. 3 is a block diagram of a data storage device according to an embodiment of the present invention;

fig. 4 is a block diagram of another embodiment of a data storage device according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the disclosure, the invention will be described in further detail with reference to the accompanying drawings and specific embodiments. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms "first," "second," "third," "fourth," and the like in the description and claims of this application and in the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "comprising" and "having," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements but may include other steps or elements not expressly listed.

Having described the technical solutions of the embodiments of the present invention, various non-limiting embodiments of the present application are described in detail below.

Referring to fig. 1, fig. 1 is a schematic flow chart of a data storage method according to an embodiment of the present invention, where the embodiment of the present invention includes the following contents:

s101: and creating a hive table, wherein the hive table is used for reading data in a Protobuf structure data storage format stored in the bottom layer.

In the present application, any prior art may be used to create the hive table, and for a specific creation process, reference is made to the description of the corresponding technology, which is not described herein again. The hive table created in S101 is the same as the hive table in the prior art.

It can be understood that the Protobuf structure data storage format has both security and data compression, is a language-independent, platform-independent, extensible serialization structure data storage method, and is a lightweight and efficient structure data storage format that can be used for structured data serialization, or serialization. The Hive data warehouse stores data in a Protobuf structure data storage format, and the safety of the stored data can be improved. It should be noted that the purpose of creating the hive table in this step is to specify a parsing manner for parsing the data in the Protobuf structure data storage format and a parsing entry.

S102: and generating a corresponding programming language file from the description file of the Protobuf serialized data, and sending the corresponding programming language file to a file package to be loaded.

In this embodiment, the format of the data stored in the bottom layer of the hive data warehouse is Protobuf, and the data in the storage format of the Protobuf structure data exists in a serialized form during storage or transmission, that is, in the embodiment of the present invention, the Protobuf serialized file stored in the bottom layer of the hive data warehouse is to be read.

S103: and configuring fields for analyzing data in the hive table according to an analysis mode corresponding to a pre-constructed data analyzer.

In the application, the data parser can be used for matching the hive table with the configuration file through the table mode and the configuration structure, and generating an Object set and a hive result Object set of the hive table structure. The configuration file is a configuration file of a system, the Object set of the hive table structure is used for realizing subsequent automation operation, and the hive result Object set is a data structure which converts Protobuf serialized data into executable subject operation. The data parser only needs to configure some field parsing classes in the hive table to parse Protobuf data so as to achieve the purpose of directly using hiveSQL to query and access data stored in the bottom layer, fields can be added conveniently without modifying program codes, and when the structure of data transmitted in the upper layer is changed, the lower run sequence can adapt to the change of a data structure only by changing a configuration file, so that the risk caused by the change of an application program in an application layer due to the change of the data structure is reduced to a great extent.

S104: and automatically analyzing and reading Protobuf serialized data by using a hive table based on a data analyzer.

The data parser can be parsed by using an initialization method of a hive built-in serialization parsing function SerDes, and the Protobuf serialization file can be read and parsed by using a hive table created by S101 to execute a hiveSQL query statement.

In the technical scheme provided by the embodiment of the invention, the automatic analysis of the hive data warehouse and the reading of the Protobuf serialized files stored at the bottom layer can be realized only by configuring the data analysis field of the hive table by utilizing the pre-constructed data analyzer, and as the data stored in the Protobuf structural data storage format is high in safety and has a data compression function, the data storage safety of the hive data warehouse is improved; in addition, original program codes do not need to be modified, the problem of analyzing the bottom-layer Protobuf file storage of the hive data warehouse is solved with the least development amount, the change of the data structure can be adapted to the lower run sequence only by changing the configuration file when the structure of data transmitted on the upper layer is changed, and the risk caused by the change of application programs of the application layer generated by an enterprise for the change of the data structure is reduced to a great extent.

As a preferred implementation, when applied to a Java language environment, the data parser may include a pattern matcher for pattern matching of configuration files and hive tables, an Object converter implemented using Java CGlib and including conversion logic of Java objects to Object objects, and a nested traversing device for traversing Java objects generated by Protobuf using nested objects. Accordingly, the construction process of the data parser may include:

it will be appreciated that a class may first be defined that inherits the abstract class Abstract SerDes to implement the initialize, deserialize and getObjectInspector methods. Abstract SerDese is an abstract class built in hive, and is mainly responsible for defining all unrealized abstract methods of the abstract class, and a hive interpreter also needs to inherit the abstract class. The Initialize is an abstract method defined in an abstract Serde abstract class, is mainly responsible for initialization preparation work of an interpreter, such as loading environment variables, table information and the like, and the hive self-contained interpreter also needs to inherit the abstract class. Desrialize is an abstract method defined in an Abstract Serde abstract class, is mainly responsible for realizing an analytic method of a deserialized data file, and a hive self-contained interpreter also needs to inherit the abstract class. getObjectInspector is an abstract method defined in an AbstractSerde abstract class, and is mainly responsible for returning data of a serialized result object, and a hive self-contained interpreter also needs to inherit the abstract class.

And setting a pattern matching file for reading the configuration file and associating the hive table structure so as to match the hive table with the configuration file in the table pattern and configuration structure. In the implementation process, a type for reading the configuration file is defined and is associated with the hive table structure, and the type is used as the configuration file and is associated with the pattern matching file of the hive table structure.

An Object conversion file for implementing conversion logic of the Java Object to the Object based on the subclass agent is set. In implementation, the conversion logic from Java Object to Object implemented using Java CGlib can be implemented by defining a class as an Object conversion file. Java CGlib is an implementation mode of a Java dynamic proxy, which can also be called as a subclass proxy, and realizes the extension of the functions of target objects by constructing a subclass object in a memory.

A nested traversal file is provided for generating Java objects using the nested object traversal Protobuf serialization data. In implementation, a nested traversal tool for traversing a Java object generated by Protobuf through a nested object can be used by defining a class, and the class is used as a nested traversal file.

Based on the data parser constructed above, the step S104 of automatically parsing and reading the Protobuf serialized data by using the hive table may specifically include:

based on the analytic mode, generating an Object set of the hive table structure by an assembly pattern matching file and an Object conversion file by rewriting an initialization function of the hive data warehouse; the initialization function here may be the initialization method in the hive own AbstractSerDes.

Based on the parsing mode, the hive result object set generated by assembling the nested traversal files can be achieved by rewriting the data parsing function of the hive data warehouse. The data parsing function here may be the deseriaize method in the hive own AbstractSerde.

And generating an executable file package, and setting the position of a java file generated by the Protobuf structure definition file in a configuration file so as to load the executable file package into the hive environment variable and use the hive table to specify an entry for analyzing and reading Protobuf serialized data.

And executing a hieveSQL query statement to read Protobuf serialized data.

Considering that the data stored at the bottom layer of the hive data warehouse cannot be analyzed and read due to network reasons or analyzer faults and the like, in order to locate the fault reasons as soon as possible and repair the faults in time. Optionally, in an implementation manner, referring to fig. 2, based on the foregoing embodiment, the method may further include:

s105: and judging whether the information of successful data analysis reading is received within the preset time, and if not, executing S106.

S106: and alarming the data reading error, and feeding back log file information in the process of automatically analyzing and reading Protobuf serialized data.

In this embodiment, by setting the self-feedback step, if the system does not receive feedback information indicating that data analysis and reading are successful when starting to analyze and read the stored data within a preset time period, for example, 10s, it is verified that a fault occurs during execution of the data analysis and reading, log file information of system operation during the time period is captured in time and fed back to the system, information related to the fault before being covered by a subsequent log file can be avoided, a worker can capture a bug from the log file accurately and in time, the fault is repaired efficiently, and overall performance of the whole system is improved.

The embodiment of the invention also provides a corresponding implementation device for the data storage method, so that the method has higher practicability. In the following, the data storage device provided by the embodiment of the present invention is introduced, and the data storage device described below and the data storage method described above may be referred to correspondingly.

Referring to fig. 3, fig. 3 is a structural diagram of a data storage device according to an embodiment of the present invention, in an embodiment, the data storage device may include:

a Hive table creating module 301, configured to create a Hive table, where the Hive table is used to read data in a Protobuf structure data storage format stored in an underlying layer.

The data conversion module 302 is configured to generate a corresponding programming language file from the description file of the Protobuf serialized data, and send the corresponding programming language file to the to-be-loaded package.

The analysis mode configuration module 303 is configured to configure a field for analyzing data in the hive table according to an analysis mode corresponding to a preset constructed data analyzer; the data parser is used for matching the hive table with the configuration file through the table mode and the configuration structure, and generating an Object set and a hive result Object set of the hive table structure.

And the data automatic parsing and reading module 304 is used for automatically parsing and reading the Protobuf serialized data by using the hive table based on the data parser.

As a preferred implementation manner of this embodiment, the data parser may include a pattern matcher, an object transformer, and a nested traversing apparatus;

the pattern matcher is used for matching the hive table with the configuration file to obtain a table pattern and a configuration structure;

the nested traversal device is used for traversing Protobuf serialized data by using the nested objects to generate Java objects.

Optionally, in some embodiments of this embodiment, referring to fig. 4, for example, the apparatus may further include an alarm module 305, configured to alarm a data reading error if the information that the data analysis and reading are successful is not received within a preset time, and feed back log file information in the process of automatically analyzing and reading the Protobuf serialized data.

In some other embodiments, the data automatic analysis reading module may specifically include:

the automatic Object generation submodule is used for generating an Object set of the hive table structure by rewriting initialization functions of the hive data warehouse so as to assemble the pattern matching file and the Object conversion file;

the hive result object set generation submodule is used for generating a hive result object set by assembling nested traversal files by rewriting the data analysis function of the hive data warehouse;

the analysis mode designation submodule is used for generating an executable file package, setting the position of a java file generated by a Protobuf structure definition file in a configuration file, and loading the executable file package into a hive environment variable to designate an inlet for analyzing and reading Protobuf serialized data by using a hive table;

and the reading submodule is used for executing a hieveSQL query statement to read Protobuf serialized data.

The functions of the functional modules of the data storage device according to the embodiment of the present invention may be specifically implemented according to the method in the foregoing method embodiment, and the specific implementation process may refer to the relevant description of the foregoing method embodiment, which is not described herein again.

Therefore, the embodiment of the invention realizes the analysis problem of the bottom Protobuf file storage of the hive data warehouse with the least development amount, and is beneficial to improving the data storage safety of the hive data warehouse.

An embodiment of the present invention further provides a data storage device, which may specifically include:

a memory for storing a computer program;

a processor for executing a computer program to implement the steps of the data storage method according to any of the above embodiments.

The embodiment of the present invention further provides a computer-readable storage medium, in which a data storage program is stored, and the data storage program is executed by a processor, and the steps of the data storage method according to any one of the above embodiments are performed.

The functions of the functional modules of the computer-readable storage medium according to the embodiment of the present invention may be specifically implemented according to the method in the foregoing method embodiment, and the specific implementation process may refer to the related description of the foregoing method embodiment, which is not described herein again.

The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed in the embodiment corresponds to the method disclosed in the embodiment, so that the description is simple, and the relevant points can be referred to the description of the method part.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The data storage method, device, equipment and computer readable storage medium provided by the invention are described in detail above. The principles and embodiments of the present invention have been described herein using specific examples, which are presented only to assist in understanding the method and its core concepts of the present invention. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.

Claims

1. A method of storing data, comprising:

the data parser is used for matching the hive table with a configuration file through a table pattern and a configuration structure, and generating an Object set and a hive result Object set of the hive table structure; the construction process of the data parser applied to the Java language environment comprises the following steps:

2. The data storage method according to claim 1, wherein the automatically parsing and reading the Protobuf serialized data with the hive table based on the data parser comprises:

and executing a hieveSQL query statement to read the Protobuf serialized data.

3. The data storage method according to claim 1 or 2, wherein the automatically parsing and reading the Protobuf serialization data by using the hive table based on the data parser comprises:

4. A data storage device, comprising:

the data automatic analysis reading module is used for automatically analyzing and reading the Protobuf serialized data by utilizing the hive table based on the data analyzer;

the data parser comprises a pattern matcher, an object converter and a nested traversing device;

5. The data storage device of claim 4, wherein the data automated parsing reading module comprises:

the automatic Object generation submodule is used for generating an Object set of the hive table structure by rewriting initialization functions of the hive data warehouse to realize assembling a pattern matching file and an Object conversion file;

the hive result object set generation submodule is used for generating a hive result object set by rewriting a data analysis function of the hive data warehouse to assemble and nest the traversal file;

the analysis mode designation submodule is used for generating an executable file packet and setting the position of a java file generated by a Protobuf structure definition file in the configuration file so as to load the executable file packet into a hive environment variable and designate an inlet for analyzing and reading the Protobuf serialized data by using the hive table;

and the reading sub-module is used for executing a hiveSQL query statement to read the Protobuf serialized data.

6. The data storage device of claim 4 or 5, further comprising an alarm module, configured to perform an alarm of a data reading error if no information that data parsing and reading are successful is received within a preset time, and feed back log file information of the process of automatically parsing and reading the Protobuf serialized data.

7. A data storage device comprising a processor for implementing the steps of the data storage method of any one of claims 1 to 3 when executing a computer program stored in a memory.

8. A computer-readable storage medium, characterized in that a data storage program is stored on the computer-readable storage medium, which data storage program, when executed by a processor, carries out the steps of the data storage method according to any one of claims 1 to 3.