CN114625730A - Data storage method and device - Google Patents

Data storage method and device Download PDF

Info

Publication number
CN114625730A
CN114625730A CN202011474676.6A CN202011474676A CN114625730A CN 114625730 A CN114625730 A CN 114625730A CN 202011474676 A CN202011474676 A CN 202011474676A CN 114625730 A CN114625730 A CN 114625730A
Authority
CN
China
Prior art keywords
data
database
storage
preset
processed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011474676.6A
Other languages
Chinese (zh)
Inventor
黄耀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Guoshuang Software Co ltd
Original Assignee
Suzhou Guoshuang Software Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Guoshuang Software Co ltd filed Critical Suzhou Guoshuang Software Co ltd
Priority to CN202011474676.6A priority Critical patent/CN114625730A/en
Publication of CN114625730A publication Critical patent/CN114625730A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data storage method and a data storage device, relates to the field of data processing, and mainly aims to meet the requirement of operating or analyzing detailed contents of data in the data storage process. The method comprises the following steps: acquiring data to be put in storage through a preset data stream processing frame; processing the data to be put in storage through a preset data stream processing frame to obtain processed data, wherein the processed data is data for a user to analyze and read based on different dimensions; and storing the processed data in a target database, wherein the target database comprises a first database and a second database, the second database is used for receiving excessive data when the data volume stored in the first database exceeds a preset threshold, and the first database and the second database are both databases supporting random reading and writing and different-dimension analysis of the data in the databases. The invention is used for the storage process of data.

Description

Data storage method and device
Technical Field
The present invention relates to the field of data processing, and in particular, to a method and an apparatus for storing data.
Background
Conventionally, after raw data is generated from a data source, the raw data cannot be directly read and analyzed by a user due to the data form, the raw data generally needs to be processed and stored, and when a data request of the user is detected, the processed data is fed back to a user request end. Due to the increasing amount of data, the number of data sources and the amount of raw data generated by each data source are also greatly increased, in this case, the prior art often processes and stores the raw data uniformly every hour or day to deal with the subsequent data request. However, before regular unified processing, there may be situations where newly generated raw data needs to be read or operated, and based on this, the prior art often simply aggregates the data that has not been processed to obtain aggregated information to meet the needs of users.
At present, in the process of storing data, raw data is generally processed and stored periodically, and unprocessed data is summarized to obtain summarized information. However, in practical applications, there may be a need for a user to perform an operation or analysis on a specific detail of data, and the summary information is only a summary result, which cannot meet the need for performing an operation on a specific detail of data as the processed data does, that is, when the data that the user needs to operate or analyze belongs to unprocessed data, the data storage manner in the prior art is difficult to meet the operation need of the user. Therefore, how to ensure the requirement of operating or analyzing detailed contents of data in the process of data storage becomes an urgent problem to be solved in the field.
Disclosure of Invention
In view of the above problems, the present invention provides a method and an apparatus for storing data, and a main objective of the present invention is to meet the requirement of operating or analyzing detailed content of data during data storage.
In order to solve the above technical problem, in a first aspect, the present invention provides a data storage method, including:
acquiring data to be put in storage through a preset data stream processing frame;
processing the data to be put in storage through a preset data stream processing frame to obtain processed data, wherein the processed data is data for a user to analyze and read based on different dimensions;
and storing the processed data in a target database, wherein the target database comprises a first database and a second database, the second database is used for receiving excessive data when the data volume stored in the first database exceeds a preset threshold, and the first database and the second database are both databases supporting random reading and writing and different-dimension analysis of the data in the databases.
Optionally, after storing the processed data in the target database, the method further includes:
and when the data operation request is detected, acquiring data corresponding to the data operation request from the target database through a preset data query system.
Optionally, before the when the data operation request is detected, the method further includes:
constructing a data directory for the data stored in the first database and the second database, wherein the data directory comprises an identifier of each processed data and a corresponding storage position, and the storage positions comprise the first database and the second database;
when the data operation request is detected, acquiring data corresponding to the data operation request from the target database through a preset data query system, including:
when a data operation request is detected, determining the storage position of the processed data corresponding to the data operation request from the data directory according to the data operation request;
and acquiring the corresponding processed data from the first database or the second database according to the storage position.
Optionally, the processing the data to be put into a warehouse by using a preset data stream processing frame, and obtaining the processed data includes:
and executing data analysis operation, data cleaning operation and/or data filling operation on the data to be put in storage through a preset data stream processing frame to obtain the processed data.
Optionally, the storing the processed data in a target database includes:
judging whether the data volume stored in the first database exceeds a preset threshold value or not;
if not, the content is not exceeded; storing the processed data in the first database and marking the warehousing time;
if the time length of the processed data exceeds the preset time length, transferring the processed data with the warehousing time length exceeding the preset time length in the first database to the second database, and storing the currently received processed data into the first database, wherein the warehousing time length is used for representing the storage time length of the processed data stored in the first database, and the warehousing time length is obtained based on the difference between the warehousing time and the current time.
Optionally, the preset data stream processing framework includes a Flink framework, the first database includes a Kudu database, the second database includes an HDFS database, and the preset data query system includes an impala query system.
In a second aspect, an embodiment of the present invention further provides a data storage device, including:
the first acquisition unit is used for acquiring data to be put in storage through a preset data stream processing frame;
the processing unit is used for processing the data to be stored in the warehouse through a preset data stream processing frame to obtain processed data, wherein the processed data is data for a user to analyze and read based on different dimensions;
and the storage unit is used for storing the processed data in a target database, wherein the target database comprises a first database and a second database, the second database is used for receiving excessive data when the data volume stored in the first database exceeds a preset threshold, and the first database and the second database are both databases supporting random reading and writing and different dimension analysis of the data in the databases.
Optionally, the apparatus further comprises:
and the second acquisition unit is used for acquiring data corresponding to the data operation request from the target database through a preset data query system when the data operation request is detected.
Optionally, the apparatus further comprises:
a constructing unit, configured to construct a data directory for data stored in the first database and the second database, where the data directory includes an identifier of each processed data and a corresponding storage location, and the storage location includes the first database and the second database;
the second acquisition unit includes:
the determining module is used for determining the storage position of the processed data corresponding to the data operation request from the data directory according to the data operation request when the data operation request is detected;
and the acquisition module is used for acquiring the corresponding processed data from the first database or the second database according to the storage position.
Optionally, the processing unit is specifically configured to perform a data parsing operation, a data cleaning operation, and/or a data filling operation on the data to be put into a storage through a preset data stream processing frame, so as to obtain the processed data.
Optionally, the storage unit includes:
the judging module is used for judging whether the data volume stored in the first database exceeds a preset threshold value or not;
the storage module is used for storing the processed data in the first database and marking the warehousing time if the data volume stored in the first database is judged not to exceed a preset threshold;
and the transmission module is used for transferring the processed data with the warehousing duration exceeding the preset duration in the first database to the second database and storing the currently received processed data into the first database if the data volume stored in the first database is judged to exceed the preset threshold, wherein the warehousing duration is used for representing the storage duration of the processed data stored in the first database, and the warehousing duration is obtained based on the difference between the warehousing time and the current time.
Optionally, the preset data stream processing framework includes a Flink framework, the first database includes a Kudu database, the second database includes an HDFS database, and the preset data query system includes an impala query system.
In order to achieve the above object, according to a third aspect of the present invention, there is provided a storage medium including a stored program, wherein when the program runs, a device on which the storage medium is located is controlled to execute the above data storage method.
In order to achieve the above object, according to a fourth aspect of the present invention, there is provided an apparatus comprising at least one processor, and at least one memory connected with the processor, a bus; the processor and the memory complete mutual communication through a bus; the processor is used for calling the program instructions in the memory, and the program instructions execute the data storage method when running.
By means of the technical scheme, the data storage method and the data storage device provided by the invention have the advantages that for the problem that in the data storage process of the prior art, data which is not processed but is processed cannot reach a regular processing period, so that the requirement of a user on detailed content analysis and operation on the part is difficult to meet, the data to be put into a warehouse is obtained through the preset data stream processing frame; processing the data to be put in storage through a preset data stream processing frame to obtain processed data; and storing the processed data in a target database, wherein the target database comprises a first database and a second database, the second database is used for receiving excessive data when the data volume stored in the first database exceeds a preset threshold, and the first database and the second database are both databases supporting random reading and writing and different-dimension analysis on the data in the databases, so that the function of storing the data is realized. In the above scheme, the processed data is data for a user to analyze and read based on different dimensions, so that the data stored in the target database are all data capable of performing detailed content analysis of data, and therefore, the method of the invention ensures that the stored data are all data capable of meeting the operation and analysis requirements of the user in the process of storing the data, and avoids the problem that the operation and analysis requirements of the user on the detailed content of the data are difficult to meet due to unprocessed data in the prior art. Meanwhile, because the processed data is stored in the first database or the second database, and the two databases are both databases supporting random reading and writing and different-dimension analysis of the data in the database, the method ensures that the stored data can be quickly read and written and analyzed in different dimensions based on the requirements of users, so that the users can also meet the requirements of combining the execution of query and operation in any dimension on the basis of analyzing and operating the stored data based on detailed contents.
The above description is only an overview of the technical solutions of the present invention, and the present invention can be implemented in accordance with the content of the description so as to make the technical means of the present invention more clearly understood, and the above and other objects, features, and advantages of the present invention will be more clearly understood.
Drawings
Various additional advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
fig. 1 is a flowchart illustrating a data storage method according to an embodiment of the present invention;
FIG. 2 is a flow chart of another data storage method provided by the embodiment of the invention;
FIG. 3 is a block diagram illustrating a data storage device according to an embodiment of the present invention;
FIG. 4 is a block diagram illustrating another data storage device according to an embodiment of the present invention;
fig. 5 is a block diagram illustrating a device for storing data according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
In order to meet the requirement of performing operation or analysis on detailed content of data in the data storage process, an embodiment of the present invention provides a data storage method, as shown in fig. 1, where the method includes:
101. and acquiring data to be put in storage through a preset data stream processing frame.
In the embodiment of the present invention, the preset data stream processing framework may be understood as a framework capable of supporting data sources of multiple formats and acquiring data from the data sources. Meanwhile, data generated by the data source is continuously generated in real time, so that in this embodiment, the data to be put into a warehouse is continuously acquired based on the preset data stream processing framework in real time. Therefore, the method can be ensured to be capable of continuously acquiring the data to be put into storage in real time when the data to be put into storage is continuously generated by the data source so as to execute the subsequent steps.
102. And processing the data to be put in storage through a preset data stream processing frame to obtain processed data.
The processed data is data for a user to analyze and read based on different dimensions. Because the data to be put in storage is actually data generated from a data source, that is, the data cannot be directly analyzed by a user, in this step, the data to be put in storage can be processed based on the preset data stream processing framework, and data which can satisfy the requirement of the user for performing analysis and reading based on different dimensions, that is, processed data, is obtained. Specifically, the data processing process includes, but is not limited to, data parsing operations, data cleansing operations, data filling operations, and other related operations. The processing method is not specifically limited, and may be selected based on the processing function of the preset data stream processing framework.
103. And storing the processed data in a target database.
The target database comprises a first database and a second database, the second database is used for receiving excessive data when the data volume stored in the first database exceeds a preset threshold, and the first database and the second database are both databases supporting random reading and writing and different-dimension analysis of data in the databases.
In the prior data storage process, data is usually stored in a preset database, however, in practical application, the existing data storage mode has the problem that the database reading and writing speed becomes slow under the condition that the data volume is high for some databases. Based on this, in this embodiment, the processed data is stored in the target database including the first database and the second database, and when the data volume of the first database exceeds the preset threshold, the data with a long warehousing time is transferred to the second database, so that the problem of reduction in data reading and writing speed caused by only using the first database exceeding a certain data volume can be avoided, and thus, not only can the efficiency be improved in the storage process, but also the efficiency of acquiring data from the first database can be improved if subsequent operations such as data reading exist.
Meanwhile, in the embodiment, both the first database and the second database in the target database are databases supporting random reading and writing and different-dimension analysis on the data in the database, so that on the basis of ensuring that the target database has better data reading and writing efficiency, analysis and operation of multiple dimensions on the processed data in the target database can be met, and the requirement of a user on relevant operation on the stored data is met.
In the data storage method provided by the embodiment, for the problem that in the prior art, in the data storage process, data which is not processed but is not processed cannot be regularly processed, and the requirement of a user on detailed content analysis and operation on the part is difficult to meet, the data to be put into a warehouse is acquired through a preset data stream processing frame; processing the data to be put in storage through a preset data stream processing frame to obtain processed data; and storing the processed data in a target database, wherein the target database comprises a first database and a second database, the second database is used for receiving excessive data when the data volume stored in the first database exceeds a preset threshold, and the first database and the second database are both databases supporting random reading and writing and different-dimension analysis on the data in the databases, so that the function of storing the data is realized. In the above scheme, the processed data is data for a user to analyze and read based on different dimensions, so that the data stored in the target database are all data capable of performing detailed content analysis of data, and therefore, the method of the invention ensures that the stored data are all data capable of meeting the operation and analysis requirements of the user in the process of storing the data, and avoids the problem that the operation and analysis requirements of the user on the detailed content of the data are difficult to meet due to unprocessed data in the prior art. Meanwhile, because the processed data is stored in the first database or the second database, and the two databases are both databases supporting random reading and writing and different-dimension analysis of the data in the database, the method ensures that the stored data can be quickly read and written and analyzed in different dimensions based on the requirements of users, so that the users can also meet the requirements of combining the execution of query and operation in any dimension on the basis of analyzing and operating the stored data based on detailed contents.
Further, as a refinement and an extension of the embodiment shown in fig. 1, an embodiment of the present invention further provides another data storage method, as shown in fig. 2, the specific steps include:
201. and acquiring data to be put in storage through a preset data stream processing frame.
The preset data stream processing framework comprises a Flink framework, is called Apache Flink in a whole, is a framework and a distributed processing engine, and is used for performing stateful computation on unbounded and bounded data streams. Flink is designed to run in all common clustered environments, performing calculations at memory speed and any scale. The Flink framework supports high throughput, low latency, high performance stream processing. The Flink framework is mapped to stream data streams after run-time, each Flink data stream starting with one or more sources (data input, e.g. message queue or file system) and ending with one or more receivers (data output, e.g. message queue, file system or database, etc.). Flink may perform any number of transformations on streams that may be arranged into directed acyclic data flow diagrams, allowing applications to branch and merge data flows. Therefore, in this embodiment, a Flink frame may be used as the preset data stream processing frame to acquire the data to be binned. Due to the fact that the Flink framework is provided with the data streams, the timeliness of data acquisition can be guaranteed when the data to be put in storage of a plurality of data sources exist, the queuing condition of data acquisition is avoided, and therefore the acquisition efficiency is improved.
202. And processing the data to be put in storage through a preset data stream processing frame to obtain processed data.
The processed data is data for a user to analyze and read based on different dimensions.
As can be seen from the foregoing description in step 201, the Flink framework is adopted in this embodiment, so that in this step, the data to be put into storage may be processed in real time when being acquired, so as to obtain processed data.
Specifically, the step may include: and executing data analysis operation, data cleaning operation and/or data filling operation on the data to be put in storage through a preset data stream processing frame to obtain the processed data.
Because the data to be put in storage is messy and disordered data generated by the data source and redundant data may exist, in the specific execution processing process in this step, the data to be put in storage may be processed in a manner including analysis, cleaning, filling and the like, that is, data analysis operation, data cleaning operation and/or data filling operation. Therefore, the data to be put into the database can be converted from the original data into the data which can be analyzed and read for the user based on different dimensions through the processing, namely the processed data. The specific data cleaning operation, data parsing operation, and data filling operation may be performed by referring to a processing method in the prior art, and are not described herein again.
203. And storing the processed data in a target database.
The target database comprises a first database and a second database, the second database is used for receiving excessive data when the data volume stored in the first database exceeds a preset threshold, and the first database and the second database are both databases supporting random reading and writing and different-dimension analysis of data in the databases. Specifically, the first database includes a Kudu database, and the second database includes an HDFS database.
In the Kudu database, the data model is similar to the traditional relational database, one Kudu cluster is composed of a plurality of tables, each table is composed of a plurality of fields, one table must specify a primary key composed of a plurality of (at least one) fields, each field in the Kudu table is of a strong type, and all fields in other databases are not considered as characters, so that the advantage of different types of data can be coded differently, and the space is saved. Meanwhile, because the usage scenario of Kudu is online analysis, the data types in the Kudu database are also more friendly to downstream analysis tools. In addition, the HDFS database, also called Hadoop distributed storage system, belongs to a distributed file storage database, which is designed to be suitable for distributed file storage running on general hardware. Meanwhile, the HDFS adopts a master-slave structure model, and an HDFS cluster is composed of a master node and a plurality of slave nodes. The main node is used as a main controller for managing a name space of a file system and the access operation of a client to a file, and the sub-nodes in the cluster manage stored data.
Specifically, the step may include:
judging whether the data volume stored in the first database exceeds a preset threshold value or not;
if not, the content is not exceeded; storing the processed data in the first database and marking the warehousing time;
if the time length of the processed data exceeds the preset time length, transferring the processed data with the warehousing time length exceeding the preset time length in the first database to the second database, and storing the currently received processed data into the first database, wherein the warehousing time length is used for representing the storage time length of the processed data stored in the first database, and the warehousing time length is obtained based on the difference between the warehousing time and the current time.
It can be seen from the above steps that, in this embodiment, when the processed data needs to be stored, the processed data is preferentially stored in the Kudu database, that is, the first database, so that it can be ensured that when a user subsequently accesses or operates the part of data, rapid access and online multi-dimensional query and analysis can be performed based on the characteristics of the Kudu database. Meanwhile, when the processed data with the data volume exceeding the preset threshold is stored in the Kudu database, partial data in the Kudu database is transferred to the HDFS database, wherein the partial data can be understood as data which is stored in the Kudu database for a long time and can also be called as "old data", and thus after the "old data" is transferred to the HDFS database, the data volume in the Kudu database can be ensured to be maintained in a stable interval, the problem of operation duration during subsequent operation caused by excessive data stored in the Kudu database is avoided, and the efficiency of subsequent data reading, analysis and operation is ensured. In addition, according to the warehousing duration of the processed data in the first database, the processed data exceeding the preset duration in the first database is transferred to the second database, namely the HDFS database, so that the data with longer storage time can be transferred to the HDFS database, and the storage requirement of a large amount of data can be met when more processed data with overlong storage time exist. In addition, when the data are transferred from the Kudu database to the HDFS database, in order to ensure the integrity of the data and avoid the problem of data loss due to transmission failure, the processed data exceeding a preset time length may be copied, the copied data may be transferred to the HDFS database, and when it is determined that the data stored in the HDFS is consistent with the previously transferred data, the processed data exceeding the preset time length in the Kudu may be deleted.
For example, when the amount of data stored in the Kudu database exceeds 80% of a preset threshold, and at this time, when it is detected that new processed data needs to be stored, according to the method in this step, the data in the Kudu library is judged according to the warehousing duration, and it is determined which of the processed data has the warehousing duration exceeding the preset duration, and if the preset duration is 1 day, the processed data with the warehousing duration exceeding 1 day in the Kudu database is transferred to the HDFS database, so that the amount of data in the Kudu database is reduced, and at the same time, the newly received processed data is directly stored in the Kudu database.
It should be noted that, in practical applications, in order to avoid the problem that the data of the Kudu database still exceeds the preset threshold after the Kudu database stores the currently received processed data, the preset time length may be set, and since the data stored in the database has precedence, the data amount in the Kudu database may be adjusted by flexibly setting the preset time length, so as to avoid the problem that the data exceeds the preset threshold.
204. And constructing a data catalog for the data stored in the first database and the second database.
The data directory comprises an identifier of each processed data and a corresponding storage position, and the storage positions comprise a first database and a second database.
In the embodiment, the data is not left alone after being stored, but the relevant operations are required to be executed based on the needs of the user, wherein the relevant operations comprise analysis, operation and the like of detailed contents of the data. Therefore, after the processed data is stored in the target database in step 203, a corresponding directory needs to be established for the data, so as to ensure that data can be acquired based on the directory when a data request subsequently exists.
In actual operation, the way of constructing the data directory can be matched with a system for querying data subsequently. For example, when the embodiment employs the impala query system, the construction of the data directory may be performed by the VIEW tool. Therefore, by constructing the data directory containing all the data in the first database and the second database, the data position can be directly determined in the data directory when a data request exists subsequently, so that a foundation is laid for rapidly acquiring data subsequently, traversing query from the two databases is not needed, and the data searching time is saved.
205. When the data operation request is detected, acquiring data corresponding to the data operation request from a target database through a preset data query system.
In a specific implementation process, the preset data query system in this embodiment may include an impala query system. The Impala is a novel Query system developed by Cloudera, and provides SQL statements, that is, Structured Query Language (SQL) which can Query PB-level big data stored in a database, and compared with other Query systems or tools, the Impala Query system has the greatest characteristic of having a faster Query speed.
Specifically, based on the fact that the corresponding data directories have been constructed for the whole data stored in the first database and the second database in the foregoing steps, the present step can be specifically executed in the following manner:
firstly, when a data operation request is detected, determining the storage position of the processed data corresponding to the data operation request from the data directory according to the data operation request.
And then, according to the storage position, acquiring the corresponding processed data from the first database or the second database.
Through the process, the corresponding processed data is inquired from the data directory based on the data request operation, and the corresponding data is acquired based on the position of each data in the data directory, so that the required data can be quickly determined. Meanwhile, the data acquisition process is executed by adopting the Impala query system, and the required data can be acquired more quickly based on the characteristics of Impala.
Further, as an implementation of the method shown in fig. 1, an embodiment of the present invention further provides a data storage device, which is used for implementing the method shown in fig. 1. The embodiment of the apparatus corresponds to the embodiment of the method, and for convenience of reading, details in the embodiment of the apparatus are not repeated one by one, but it should be clear that the apparatus in the embodiment can correspondingly implement all the contents in the embodiment of the method. As shown in fig. 3, the apparatus includes: a first acquisition unit 31, a processing unit 32 and a storage unit 33, wherein
A first obtaining unit 31, configured to obtain data to be warehoused through a preset data stream processing framework;
the processing unit 32 may be configured to process the data to be put into storage, which is acquired by the first acquiring unit 31, through a preset data stream processing frame to obtain processed data, where the processed data is data for a user to analyze and read based on different dimensions;
the storage unit 33 may be configured to store the processed data obtained by the processing unit 32 in a target database, where the target database includes a first database and a second database, the second database may be configured to receive excessive data when the amount of data stored in the first database exceeds a preset threshold, and the first database and the second database are both databases that support random reading and writing and different dimensional analysis on data in the databases.
Further, as an implementation of the method shown in fig. 2, an embodiment of the present invention further provides a data storage device, which is used for implementing the method shown in fig. 2. The embodiment of the apparatus corresponds to the embodiment of the method, and for convenience of reading, details in the embodiment of the apparatus are not repeated one by one, but it should be clear that the apparatus in the embodiment can correspondingly implement all the contents in the embodiment of the method. As shown in fig. 4, the apparatus includes: a first acquisition unit 41, a processing unit 42 and a storage unit 43, wherein
A first obtaining unit 41, configured to obtain data to be put into a library through a preset data stream processing framework;
the processing unit 42 may be configured to process the data to be put into storage, which is acquired by the first acquiring unit 41, through a preset data stream processing frame to obtain processed data, where the processed data is data for a user to analyze and read based on different dimensions;
the storage unit 43 may be configured to store the processed data obtained by the processing unit 42 in a target database, where the target database includes a first database and a second database, the second database may be configured to receive excessive data when the amount of data stored in the first database exceeds a preset threshold, and both the first database and the second database are databases that support random reading and writing and different dimensional analysis on data in the databases.
Further, the apparatus further comprises:
the second obtaining unit 44 may be configured to, when a data operation request is detected, obtain, through a preset data query system, data corresponding to the data operation request from a target database in which the processed data is stored in the storage unit 43.
Further, the apparatus further comprises:
a constructing unit 45, configured to construct a data directory for data stored in the first database and the second database in which the storage unit 43 stores the processed data, where the data directory includes an identifier and a corresponding storage location of each processed data, and the storage location includes the first database and the second database;
the second obtaining unit 44 includes:
the determining module 441 may be configured to, when a data operation request is detected, determine, according to the data operation request, a storage location of the processed data corresponding to the data operation request from the data directory;
the obtaining module 442 may be configured to obtain the corresponding processed data from the first database or the second database according to the storage location determined by the determining module 441.
Further, the processing unit 42 may be specifically configured to perform a data parsing operation, a data cleaning operation, and/or a data filling operation on the data to be warehoused through a preset data stream processing framework, so as to obtain the processed data.
Further, the storage unit 43 includes:
the determining module 431 may be configured to determine whether the amount of data stored in the first database exceeds a preset threshold;
the storage module 432 may be configured to, if the determining module 431 determines that the amount of data stored in the first database does not exceed a preset threshold, store the processed data in the first database, and mark a warehousing time;
the transmission module 433 may be configured to, if the determination module 431 determines that the amount of the data stored in the first database exceeds a preset threshold, transfer the processed data in the first database, of which the warehousing duration exceeds a preset duration, to the second database, and store the currently received processed data in the first database, where the warehousing duration may be used to represent a time period for storing the processed data in the first database, and the warehousing duration is obtained based on a difference between the warehousing time and the current time.
Further, the preset data stream processing framework comprises a Flink framework, the first database comprises a Kudu database, the second database comprises an HDFS database, and the preset data query system comprises an impala query system.
With the above technical solutions, embodiments of the present invention provide a method and an apparatus for storing data,
in the prior art, in the data storage process, the problem that the requirement of a user on detailed content analysis and operation on the part is difficult to meet due to the fact that data which is not processed but processed at regular intervals is possibly existed, the data to be put into a warehouse is acquired through a preset data stream processing frame; processing the data to be put in storage through a preset data stream processing frame to obtain processed data; and storing the processed data in a target database, wherein the target database comprises a first database and a second database, the second database is used for receiving excessive data when the data volume stored in the first database exceeds a preset threshold, and the first database and the second database are both databases supporting random reading and writing and different-dimension analysis on the data in the databases, so that the function of storing the data is realized. In the above scheme, the processed data is data for a user to analyze and read based on different dimensions, so that the data stored in the target database are all data capable of performing detailed content analysis of data, and therefore, the method of the invention ensures that the stored data are all data capable of meeting the operation and analysis requirements of the user in the process of storing the data, and avoids the problem that the operation and analysis requirements of the user on the detailed content of the data are difficult to meet due to unprocessed data in the prior art. Meanwhile, because the processed data is stored in the first database or the second database, and the two databases are both databases supporting random reading and writing and different-dimension analysis of the data in the database, the method ensures that the stored data can be quickly read and written and analyzed in different dimensions based on the requirements of users, so that the users can also meet the requirements of combining the execution of query and operation in any dimension on the basis of analyzing and operating the stored data based on detailed contents.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more than one, and the method for meeting the requirement of operating or analyzing the detail content of the data in the data storage process is realized by adjusting the kernel parameters.
An embodiment of the present invention provides a storage medium on which a program is stored, the program implementing the storage method of the data when executed by a processor.
The embodiment of the invention provides a processor, which is used for running a program, wherein the program executes a storage method of data when running.
An embodiment of the present invention provides an apparatus 50, as shown in fig. 5, the apparatus includes at least one processor 501, at least one memory 502 connected to the processor, and a bus 503; the processor 501 and the memory 502 complete communication with each other through the bus 503; the processor 501 is used for calling program instructions in the memory to execute the above-mentioned data storage method.
The device herein may be a server, a PC, a PAD, a mobile phone, etc.
The present application further provides a computer program product adapted to perform a program for initializing the following method steps when executed on a data processing device: acquiring data to be put in storage through a preset data stream processing frame; processing the data to be put in storage through a preset data stream processing frame to obtain processed data, wherein the processed data is data for a user to analyze and read based on different dimensions; and storing the processed data in a target database, wherein the target database comprises a first database and a second database, the second database is used for receiving excessive data when the data volume stored in the first database exceeds a preset threshold, and the first database and the second database are both databases supporting random reading and writing and different-dimension analysis of the data in the databases.
Further, after the storing the processed data in the target database, the method further comprises:
and when the data operation request is detected, acquiring data corresponding to the data operation request from the target database through a preset data query system.
Further, before the when the data operation request is detected, the method further includes:
constructing a data directory for the data stored in the first database and the second database, wherein the data directory comprises an identifier of each processed data and a corresponding storage position, and the storage positions comprise the first database and the second database;
when the data operation request is detected, acquiring data corresponding to the data operation request from the target database through a preset data query system, including:
when a data operation request is detected, determining a storage position of the processed data corresponding to the data operation request from the data directory according to the data operation request;
and acquiring the corresponding processed data from the first database or the second database according to the storage position.
Further, the processing the data to be put into a storage through a preset data stream processing frame to obtain processed data includes:
and executing data analysis operation, data cleaning operation and/or data filling operation on the data to be put in storage through a preset data stream processing frame to obtain the processed data.
Further, the storing the processed data in a target database includes:
judging whether the data volume stored in the first database exceeds a preset threshold value or not;
if not, the content is not exceeded; storing the processed data in the first database and marking the warehousing time;
if the time length of the processed data exceeds the preset time length, transferring the processed data with the warehousing time length exceeding the preset time length in the first database to the second database, and storing the currently received processed data into the first database, wherein the warehousing time length is used for representing the storage time length of the processed data stored in the first database, and the warehousing time length is obtained based on the difference between the warehousing time and the current time.
Further, the preset data stream processing framework comprises a Flink framework, the first database comprises a Kudu database, the second database comprises an HDFS database, and the preset data query system comprises an impala query system.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a device includes one or more processors (CPUs), memory, and a bus. The device may also include input/output interfaces, network interfaces, and the like.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip. The memory is an example of a computer-readable medium.
Computer-readable media, including both permanent and non-permanent, removable and non-removable media, may implement the information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (10)

1. A method for storing data, comprising:
acquiring data to be put in storage through a preset data stream processing frame;
processing the data to be put in storage through a preset data stream processing frame to obtain processed data, wherein the processed data is data for a user to analyze and read based on different dimensions;
and storing the processed data in a target database, wherein the target database comprises a first database and a second database, the second database is used for receiving excessive data when the data volume stored in the first database exceeds a preset threshold, and the first database and the second database are both databases supporting random reading and writing and different-dimension analysis of the data in the databases.
2. The method of claim 1, wherein after said storing said processed data in a target database, said method further comprises:
and when the data operation request is detected, acquiring data corresponding to the data operation request from the target database through a preset data query system.
3. The method of claim 2, wherein prior to said detecting a data operation request, the method further comprises:
constructing a data directory for the data stored in the first database and the second database, wherein the data directory comprises an identifier of each processed data and a corresponding storage position, and the storage positions comprise the first database and the second database;
when the data operation request is detected, acquiring data corresponding to the data operation request from the target database through a preset data query system, including:
when a data operation request is detected, determining a storage position of the processed data corresponding to the data operation request from the data directory according to the data operation request;
and acquiring the corresponding processed data from the first database or the second database according to the storage position.
4. The method according to claim 1, wherein the processing the data to be put into the database through a preset data stream processing framework to obtain processed data comprises:
and executing data analysis operation, data cleaning operation and/or data filling operation on the data to be put in storage through a preset data stream processing frame to obtain the processed data.
5. The method according to any one of claims 1-4, wherein the storing the processed data in a target database comprises:
judging whether the data volume stored in the first database exceeds a preset threshold value or not;
if not, the content is not exceeded; storing the processed data in the first database and marking the warehousing time;
if the time length of the processed data exceeds the preset time length, transferring the processed data with the warehousing time length exceeding the preset time length in the first database to the second database, and storing the currently received processed data into the first database, wherein the warehousing time length is used for representing the storage time length of the processed data stored in the first database, and the warehousing time length is obtained based on the difference between the warehousing time and the current time.
6. The method of claim 5, wherein the default data stream processing framework comprises a Flink framework, wherein the first database comprises a Kudu database, wherein the second database comprises an HDFS database, and wherein the default data query system comprises an impala query system.
7. An apparatus for storing data, comprising:
the first acquisition unit is used for acquiring data to be put in storage through a preset data stream processing frame;
the processing unit is used for processing the data to be stored in the warehouse through a preset data stream processing frame to obtain processed data, wherein the processed data is data for a user to analyze and read based on different dimensions;
and the storage unit is used for storing the processed data in a target database, wherein the target database comprises a first database and a second database, the second database is used for receiving excessive data when the data volume stored in the first database exceeds a preset threshold, and the first database and the second database are both databases supporting random reading and writing and different dimension analysis of the data in the databases.
8. The apparatus of claim 7, further comprising:
and the second acquisition unit is used for acquiring data corresponding to the data operation request from the target database through a preset data query system when the data operation request is detected.
9. A storage medium, characterized in that the storage medium comprises a stored program, wherein when the program runs, a device where the storage medium is located is controlled to execute the data storage method of any one of claims 1 to 6.
10. An apparatus comprising at least one processor, and at least one memory, bus connected to the processor; the processor and the memory complete mutual communication through a bus; the processor is configured to call program instructions in the memory, which when executed perform a method of storing data as claimed in any one of claims 1 to 6.
CN202011474676.6A 2020-12-14 2020-12-14 Data storage method and device Pending CN114625730A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011474676.6A CN114625730A (en) 2020-12-14 2020-12-14 Data storage method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011474676.6A CN114625730A (en) 2020-12-14 2020-12-14 Data storage method and device

Publications (1)

Publication Number Publication Date
CN114625730A true CN114625730A (en) 2022-06-14

Family

ID=81896814

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011474676.6A Pending CN114625730A (en) 2020-12-14 2020-12-14 Data storage method and device

Country Status (1)

Country Link
CN (1) CN114625730A (en)

Similar Documents

Publication Publication Date Title
CN109582660B (en) Data blood margin analysis method, device, equipment, system and readable storage medium
US10831562B2 (en) Method and system for operating a data center by reducing an amount of data to be processed
CN111324610A (en) Data synchronization method and device
US10860604B1 (en) Scalable tracking for database udpates according to a secondary index
CN108196787B (en) Quota management method of cluster storage system and cluster storage system
CN107016039B (en) Database writing method and database system
CN111897808B (en) Data processing method and device, computer equipment and storage medium
CN113111038B (en) File storage method, device, server and storage medium
CN111061758A (en) Data storage method, device and storage medium
CN111858760A (en) Data processing method and device for heterogeneous database
CN111723161A (en) Data processing method, device and equipment
CN113177090A (en) Data processing method and device
CN110019497B (en) Data reading method and device
CN112491943A (en) Data request method, device, storage medium and electronic equipment
CN114625730A (en) Data storage method and device
CN116108036A (en) Method and device for off-line exporting back-end system data
US10114864B1 (en) List element query support and processing
CN112597105A (en) Processing method of file associated object, server side equipment and storage medium
CN113297245A (en) Method and device for acquiring execution information
CN112749189A (en) Data query method and device
CN117390040B (en) Service request processing method, device and storage medium based on real-time wide table
CN113656469B (en) Big data processing method and device
CN115051981A (en) Zookeeper-based asynchronous downloading method and device
WO2020238597A1 (en) Hadoop-based data updating method, device, system and medium
KR20170085786A (en) System and method for storing data in big data platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination