CN111159192A - Data storage method and device based on big data, storage medium and processor - Google Patents

Data storage method and device based on big data, storage medium and processor Download PDF

Info

Publication number
CN111159192A
CN111159192A CN201911399211.6A CN201911399211A CN111159192A CN 111159192 A CN111159192 A CN 111159192A CN 201911399211 A CN201911399211 A CN 201911399211A CN 111159192 A CN111159192 A CN 111159192A
Authority
CN
China
Prior art keywords
data
data storage
storage
database
format
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911399211.6A
Other languages
Chinese (zh)
Other versions
CN111159192B (en
Inventor
张炎红
贠瑞峰
刘彬彬
彭翔
刘粉香
贺喆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Internetware Ltd
Original Assignee
Smart Shenzhou Beijing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Smart Shenzhou Beijing Technology Co Ltd filed Critical Smart Shenzhou Beijing Technology Co Ltd
Priority to CN201911399211.6A priority Critical patent/CN111159192B/en
Publication of CN111159192A publication Critical patent/CN111159192A/en
Application granted granted Critical
Publication of CN111159192B publication Critical patent/CN111159192B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/221Column-oriented storage; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a data storage method, a device, a storage medium and a processor based on big data, wherein the method comprises the following steps: establishing a basic storage database by adopting a distributed column storage database and a distributed retrieval engine, wherein the distributed column storage database is used for storing original data, and the distributed retrieval engine is used for storing index information of the original data; constructing a data table according to a data storage structure and a data storage format in a basic storage database; dynamically managing the data table by using a configuration library; and performing data warehousing based on the data table. According to the method, a basic storage database is established by adopting a distributed column storage database and a distributed retrieval engine, so that the rapid retrieval of data and the larger data throughput are realized, a data table is constructed by using a standard data storage structure and a data storage format, and then a dynamic management data table is managed in a database configuration mode, so that the rapid storage of the data is realized.

Description

Data storage method and device based on big data, storage medium and processor
Technical Field
The application relates to the field of big data, in particular to a data storage method and device based on big data, a storage medium and a processor.
Background
The existing distributed column storage database (Hbase) and distributed data retrieval engine (ElasticSearch) in a big data environment have no limitation on the structure of a data table, so that the data storage structure cannot be normalized, the type of automatically identified and generated data is not accurate enough, a dynamically generated storage table cannot be used basically, and when the Hbase and the distributed data retrieval engine work in a coordinated mode, a plurality of data exists in a search engine and the database, and storage resources are wasted.
The above information disclosed in this background section is only for enhancement of understanding of the background of the technology described herein and, therefore, certain information may be included in the background that does not form the prior art that is already known in this country to a person of ordinary skill in the art.
Disclosure of Invention
The application mainly aims to provide a data storage method, a data storage device, a storage medium and a processor based on big data, so as to solve the problem of low data storage efficiency in the prior art.
In order to achieve the above object, according to an aspect of the present application, there is provided a big data based data warehousing method, including: establishing a basic storage database by adopting a distributed column storage database and a distributed retrieval engine, wherein the distributed column storage database is used for storing original data, and the distributed retrieval engine is used for storing index information of the original data; constructing a data table according to a data storage structure and a data storage format in the basic storage database; dynamically managing the data table by using a configuration library; and performing data warehousing based on the data table.
Further, according to the data storage structure and the data storage mode in the basic storage database, a data table is constructed, which includes: adjusting the data storage format in the basic storage database to a preset data storage format; adjusting the data storage structure in the basic storage database into a predetermined data storage structure according to the predetermined data storage format; and constructing the data table according to the preset data storage structure and the preset data storage format.
Further, before adjusting the data storage structure in the basic storage database to a predetermined data storage structure according to the predetermined data storage format, constructing a data table according to the data storage structure and the data storage manner in the basic storage database, further comprising: automatically identifying the predetermined data storage format in the base storage database.
Further, dynamically managing the data table by using a configuration library comprises: and dynamically managing the data table by adopting the plug-in function of the xml of the configuration library.
Further, the index information is a primary key of the original data, the primary key is a unique identifier of the original data, and the primary key is composed of a generation date of the original data, a code value of the original data, and a hash value of the original data.
Further, the predetermined data storage format includes at least one of: time format, age format, name format.
Further, the data storage structure includes at least one of: the system comprises a first data storage structure and a second data storage structure, wherein the first data storage structure corresponds to the data storage structure corresponding to the information of people, the first data storage structure is composed of name information, age information and native place information, the second data storage structure corresponds to the information of companies, and the second data storage structure is composed of legal information, company position information and company annual income information.
According to another aspect of the present application, there is provided a big data-based data warehousing apparatus, including: the system comprises an establishing unit, a searching unit and a searching unit, wherein the establishing unit is used for establishing a basic storage database by adopting a distributed column storage database and a distributed searching engine, the distributed column storage database is used for storing original data, and the distributed searching engine is used for storing index information of the original data; the optimization unit is used for constructing a data table according to a data storage structure and a data storage format in the basic storage database; the management unit is used for dynamically managing the data table in a configuration library mode; and the warehousing unit is used for warehousing data based on the data table.
According to still another aspect of the present application, there is provided a storage medium including a stored program, wherein the program executes any one of the warehousing methods.
According to another aspect of the present application, there is provided a processor for executing a program, wherein the program executes any one of the warehousing methods.
According to the technical scheme, the basic storage database is established by adopting the distributed column storage database and the distributed retrieval engine in the method, so that the rapid retrieval of data and the high data throughput are realized, the data storage structure and the data storage format in the basic storage database are obtained by optimizing the original data storage structure and the original data storage format, the optimized data storage structure and the optimized data storage format are more standard, the data table is established by using the standard data storage structure and the data storage format, and then the dynamic management data table is managed in a database configuration mode, so that the rapid storage of the data is realized.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application. In the drawings:
FIG. 1 shows a flow diagram of a big-data based data-binning method according to an embodiment of the present application; and
fig. 2 shows a schematic diagram of a big-data based data warehousing apparatus according to an embodiment of the present application.
Detailed Description
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be used. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It will be understood that when an element such as a layer, film, region, or substrate is referred to as being "on" another element, it can be directly on the other element or intervening elements may also be present. Also, in the specification and claims, when an element is described as being "connected" to another element, the element may be "directly connected" to the other element or "connected" to the other element through a third element.
For convenience of description, some terms or expressions referred to in the embodiments of the present application are explained below:
hbase: the distributed column storage database is a distributed and column-oriented open source database, and the Hbase is different from a general database and is a database suitable for unstructured data storage.
Elastic search: a distributed data retrieval engine is a search server based on Lucene, provides a full-text search engine with distributed multi-user capability, and is a popular enterprise-level search engine.
Database primary key: refers to a combination of one or more columns whose value uniquely identifies each row in the table by which the physical integrity of the table is enforced, the primary key being used primarily in association with the foreign keys of other tables, as well as modification and deletion of the record.
Data table structure: the data table is composed of three parts of table name, fields in the table and records in the table, and designing the data table structure is to define the file name of the data table, determine which fields the data table contains, and the field name, field type and width of each field, and input the data into the computer.
Configuring a library: the system is used for storing configuration items and recording all information related to the configuration items, is a powerful tool for configuration management, and can answer a plurality of problems for configuration management by utilizing the information in the library.
According to the embodiment of the application, a data storage method based on big data is provided.
Fig. 1 is a flow diagram of a big-data based data-binning method according to an embodiment of the present application. As shown in fig. 1, the method comprises the steps of:
step S101, a basic storage database is established by adopting a distributed column storage database and a distributed retrieval engine, wherein the distributed column storage database is used for storing original data, and the distributed retrieval engine is used for storing index information of the original data;
step S102, constructing a data table according to the data storage structure and the data storage format in the basic storage database;
step S103, dynamically managing the data table by using a configuration library mode;
and step S104, storing data based on the data table.
According to the method, the distributed column storage database and the distributed retrieval engine are adopted to establish the basic storage database, so that rapid retrieval of data and high data throughput are achieved, the data storage structure and the data storage format in the basic storage database are obtained by optimizing on the basis of the original data storage structure and the original data storage format, the optimized data storage structure and the optimized data storage format are more standard, the standard data storage structure and the standard data storage format are used for establishing the data table, and then the dynamic management data table is managed in a database configuration mode, so that rapid storage of the data is achieved.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than presented herein.
It should be further noted that the distributed column storage database stores the original data, and the distributed retrieval engine stores the index information of the original data, so that multiple data stores in the distributed column storage database and the distributed retrieval engine due to the coordination work of the distributed column storage database and the distributed retrieval engine are avoided, and storage resources are saved.
It should be further noted that there are two different ways of representing relationships between data elements: sequential mapping and non-sequential mapping, and thus two different data storage structures: sequential storage structures and chained storage structures, a data storage structure refers to a representation of the logical structure of data in a computer.
In an embodiment of the present application, constructing a data table according to the data storage structure and the data storage manner in the basic storage database includes: adjusting the data storage format in the basic storage database to a preset data storage format; adjusting the data storage structure in the basic storage database into a preset data storage structure according to the preset data storage format; and constructing the data table according to the preset data storage structure and the preset data storage format. Specifically, the data table is composed of three parts, namely a table name, fields in the table and records in the table, the data table is constructed by defining the file name of the data table, determining the fields contained in the data table, the field name, the field type and the width of each field, and limiting the structure of the data table to be a preset data storage structure and a preset data storage format, so that the complexity of data storage is reduced, and the problem that a certain basis and detailed functional parameters are needed due to the fact that an access mode is complex is solved.
In an embodiment of the application, before adjusting the data storage structure in the basic storage database to a predetermined data storage structure according to a predetermined data storage format, a data table is constructed according to the data storage structure and the data storage manner in the basic storage database, and the method further includes: automatically identifying the predetermined data storage format in the base storage database. In particular, the predetermined data storage format is automatically recognized, so that the data storage structure in the subsequent basic storage database can be conveniently and uniformly processed into a format and a data storage standard.
In an embodiment of the present application, dynamically managing the data table by using a configuration library includes: and dynamically managing the data table by adopting the plug-in function of the xml of the configuration library. Specifically, the configuration library adopts plug-in function management of xml, has strong function expansibility, uses a configuration library dynamic management table structure, supports a preset data storage format, and encapsulates a method for data management and storage, so that the method is simple and convenient to use.
In an embodiment of the present application, the index information is a primary key of the original data, the primary key is a unique identifier of the original data, and the primary key is composed of a generation date of the original data, a code value of the original data, and a hash value of the original data.
Specifically, the unique primary key generation algorithm specifically includes the following steps:
step A: generating MD5 values for the data;
the MD5 values for the data are unique, the MD5 values for different data are different, and the MD5 value is a 32-bit alpha-plus-numeric combination such as: e10adc3949ba59abbe56e057f20f883 e;
and B: generating a hash value of the MD5 value of the data, and obtaining an absolute value of the hash value;
the hash value is a random integer value, and the hash value generated by the same data is unique. For example, the absolute value of the hash value of e10adc3949ba59abbe56e057f20f883e is 60;
and C: defining an array list, wherein the array is a-z plus 0-9 to form a 36-bit fixed array;
step D: taking the remainder of the generated hash value 36 to obtain a unique subscript of the data corresponding to the hash value, for example, taking the remainder of 60 pairs of 36, wherein the obtained subscript is 24, and the characters in the corresponding array are x;
step E: determining the generation time of each piece of data, wherein the data generation time must exist in each piece of data, and if not, the default is the current time of the system, for example, the time corresponding to the data is 2019-11-1218: 55: 55;
step F: determining a unique primary key corresponding to the data according to the characters + # +, the + # + of the date and the MD5 encoding value of the data in the rule array, for example, the unique primary key (key) corresponding to the data is as follows: x #20191112# e10adc3949ba59abbe56e057f20f883 e.
The key generation algorithm can reversely calculate the key value corresponding to the data according to the data, meanwhile, the data can be found according to the key value, the key is randomly generated, the data can be rapidly retrieved according to the time while the data are uniformly distributed, and the data throughput within the range time is improved.
In an embodiment of the application, the predetermined data storage format includes at least one of: time format, age format, name format. Specifically, the time format may be "2019.12.27", the age format may be "26 years old", the name format may be "xiaoming", and the predetermined data storage format is not limited thereto, and a person skilled in the art may select an appropriate data storage format according to actual circumstances.
In an embodiment of the application, the data storage structure includes at least one of: the information processing system comprises a first data storage structure and a second data storage structure, wherein the first data storage structure corresponds to human information and consists of name information, age information and native place information, the second data storage structure corresponds to company information and consists of legal information, company position information and company annual income information. Specifically, the data storage structure is not limited to this, and those skilled in the art may select an appropriate data storage structure according to actual situations, so as to facilitate classification query and improve retrieval efficiency, for example, the classification query includes a person information query and a company information query.
The embodiment of the present application further provides a data warehousing device based on big data, and it should be noted that the data warehousing device based on big data according to the embodiment of the present application may be used to execute the data warehousing method based on big data according to the embodiment of the present application. The following describes a data warehousing device based on big data provided by the embodiment of the present application.
Fig. 2 is a schematic diagram of a big-data-based data warehousing apparatus according to an embodiment of the present application. As shown in fig. 2, the apparatus includes:
the system comprises an establishing unit 10, a searching unit and a searching unit, wherein the establishing unit is used for establishing a basic storage database by adopting a distributed column storage database and a distributed searching engine, the distributed column storage database is used for storing original data, and the distributed searching engine is used for storing index information of the original data;
an optimizing unit 20, configured to construct a data table according to the data storage structure and the data storage format in the basic storage database;
a management unit 30 for dynamically managing the data table by using a configuration library;
and the warehousing unit 40 is used for warehousing data based on the data table.
In the device, the establishing unit establishes the basic storage database by adopting the distributed column storage database and the distributed retrieval engine, so that the rapid retrieval of data and the larger data throughput are realized, the optimizing unit optimizes the data storage structure and the data storage format in the basic storage database on the basis of the original data storage structure and data storage format, the optimized data storage structure and data storage format are more standard, the data table is constructed by using the standard data storage structure and data storage format, the management unit manages the dynamic management data table in a database configuration mode, and the warehousing unit realizes the rapid warehousing of the data.
It should be noted that the distributed column storage database stores the original data, and the distributed retrieval engine stores the index information of the original data, so that multiple data stores in the distributed column storage database and the distributed retrieval engine due to the coordination work of the distributed column storage database and the distributed retrieval engine are avoided, and storage resources are saved.
It should be further noted that there are two different ways of representing relationships between data elements: sequential mapping and non-sequential mapping, and thus two different data storage structures: sequential storage structures and chained storage structures, a data storage structure refers to a representation of the logical structure of data in a computer.
In an embodiment of the present application, the optimization unit includes a first adjusting module, a second adjusting module, and a constructing module, where the first adjusting module is configured to adjust a data storage format in the basic storage database to a predetermined data storage format; the second adjusting module is configured to adjust the data storage structure in the basic storage database to a predetermined data storage structure according to the predetermined data storage format; the construction module is used for constructing the data table according to the preset data storage structure and the preset data storage format. Specifically, the data table is composed of three parts, namely a table name, fields in the table and records in the table, the data table is constructed by defining the file name of the data table, determining the fields contained in the data table, the field name, the field type and the width of each field, and limiting the structure of the data table to be a preset data storage structure and a preset data storage format, so that the complexity of data storage is reduced, and the problem that a certain basis and detailed functional parameters are needed due to the fact that an access mode is complex is solved.
In an embodiment of the application, the optimization unit further includes an identification module, and the identification module is configured to automatically identify the predetermined data storage format in the basic storage database before adjusting the data storage structure in the basic storage database to a predetermined data storage structure according to the predetermined data storage format. In particular, the predetermined data storage format is automatically recognized, so that the data storage structure in the subsequent basic storage database can be conveniently and uniformly processed into a format and a data storage standard.
In an embodiment of the application, the management unit includes a pipeline module, and the pipeline module is configured to dynamically manage the data table by using a plug-in function of xml of the configuration library. Specifically, the configuration library adopts plug-in function management of xml, has strong function expansibility, uses a configuration library dynamic management table structure, supports a preset data storage format, and encapsulates a method for data management and storage, so that the method is simple and convenient to use.
In an embodiment of the present application, the index information is a primary key of the original data, the primary key is a unique identifier of the original data, and the primary key is composed of a generation date of the original data, a code value of the original data, and a hash value of the original data.
Specifically, the unique primary key generation algorithm specifically includes the following steps:
step A: generating MD5 values for the data;
the MD5 values for the data are unique, the MD5 values for different data are different, and the MD5 value is a 32-bit alpha-plus-numeric combination such as: e10adc3949ba59abbe56e057f20f883 e;
and B: generating a hash value of the MD5 value of the data, and obtaining an absolute value of the hash value;
the hash value is a random integer value, and the hash value generated by the same data is unique. For example, the absolute value of the hash value of e10adc3949ba59abbe56e057f20f883e is 60;
and C: defining an array list, wherein the array is a-z plus 0-9 to form a 36-bit fixed array;
step D: taking the remainder of the generated hash value 36 to obtain a unique subscript of the data corresponding to the hash value, for example, taking the remainder of 60 pairs of 36, wherein the obtained subscript is 24, and the characters in the corresponding array are x;
step E: determining the generation time of each piece of data, wherein the data generation time must exist in each piece of data, and if not, the default is the current time of the system, for example, the time corresponding to the data is 2019-11-1218: 55: 55;
step F: determining a unique primary key corresponding to the data according to the characters + # +, the + # + of the date and the MD5 encoding value of the data in the rule array, for example, the unique primary key (key) corresponding to the data is as follows: x #20191112# e10adc3949ba59abbe56e057f20f883 e.
The key generation algorithm can reversely calculate the key value corresponding to the data according to the data, meanwhile, the data can be found according to the key value, the key is randomly generated, the data can be rapidly retrieved according to the time while the data are uniformly distributed, and the data throughput within the range time is improved.
In an embodiment of the application, the predetermined data storage format includes at least one of: time format, age format, name format. Specifically, the time format may be "2019.12.27", the age format may be "26 years old", the name format may be "xiaoming", and the predetermined data storage format is not limited thereto, and a person skilled in the art may select an appropriate data storage format according to actual circumstances.
In an embodiment of the application, the data storage structure includes at least one of: the information processing system comprises a first data storage structure and a second data storage structure, wherein the first data storage structure corresponds to human information and consists of name information, age information and native place information, the second data storage structure corresponds to company information and consists of legal information, company position information and company annual income information. Specifically, the data storage structure is not limited to this, and those skilled in the art may select an appropriate data storage structure according to actual situations, so as to facilitate classification query and improve retrieval efficiency, for example, the classification query includes a person information query and a company information query.
The data warehousing device based on the big data comprises a processor and a memory, the establishing unit, the optimizing unit, the managing unit, the warehousing unit and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more than one, and the problem of low data storage efficiency in the prior art is solved by adjusting kernel parameters.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.
An embodiment of the present invention provides a storage medium, on which a program is stored, and when the program is executed by a processor, the method for data storage based on big data is implemented.
The embodiment of the invention provides a processor, which is used for running a program, wherein the data storage method based on big data is executed when the program runs.
The embodiment of the invention provides equipment, which comprises a processor, a memory and a program which is stored on the memory and can run on the processor, wherein when the processor executes the program, at least the following steps are realized:
step S101, a basic storage database is established by adopting a distributed column storage database and a distributed retrieval engine, wherein the distributed column storage database is used for storing original data, and the distributed retrieval engine is used for storing index information of the original data;
step S102, constructing a data table according to the data storage structure and the data storage format in the basic storage database;
step S103, dynamically managing the data table by using a configuration library mode;
and step S104, storing data based on the data table.
The device herein may be a server, a PC, a PAD, a mobile phone, etc.
The present application further provides a computer program product adapted to perform a program of initializing at least the following method steps when executed on a data processing device:
step S101, a basic storage database is established by adopting a distributed column storage database and a distributed retrieval engine, wherein the distributed column storage database is used for storing original data, and the distributed retrieval engine is used for storing index information of the original data;
step S102, constructing a data table according to the data storage structure and the data storage format in the basic storage database;
step S103, dynamically managing the data table by using a configuration library mode;
and step S104, storing data based on the data table.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
From the above description, it can be seen that the above-described embodiments of the present application achieve the following technical effects:
1) according to the method, a basic storage database is established by adopting a distributed column storage database and a distributed retrieval engine, so that rapid retrieval of data and high data throughput are realized, a data storage structure and a data storage format in the basic storage database are obtained by optimizing on the basis of an original data storage structure and an original data storage format, the optimized data storage structure and the optimized data storage format are more standard, a data table is established by using the standard data storage structure and the data storage format, and then the data table is managed dynamically by means of database configuration, so that rapid warehousing of the data is realized.
2) In the device, the establishing unit establishes the basic storage database by adopting the distributed column storage database and the distributed retrieval engine, so that the rapid retrieval of data and the large data throughput are realized, the optimizing unit optimizes the data storage structure and the data storage format in the basic storage database on the basis of the original data storage structure and data storage format, the optimized data storage structure and data storage format are more standard, the data table is established by using the standard data storage structure and data storage format, the management unit manages the dynamic management data table in a database configuration mode, and the warehousing unit realizes the rapid warehousing of the data.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (10)

1. A data storage method based on big data is characterized by comprising the following steps:
establishing a basic storage database by adopting a distributed column storage database and a distributed retrieval engine, wherein the distributed column storage database is used for storing original data, and the distributed retrieval engine is used for storing index information of the original data;
constructing a data table according to a data storage structure and a data storage format in the basic storage database;
dynamically managing the data table by using a configuration library;
and performing data warehousing based on the data table.
2. The warehousing method of claim 1, wherein constructing a data table according to the data storage structure and the data storage manner in the basic storage database comprises:
adjusting the data storage format in the basic storage database to a preset data storage format;
adjusting the data storage structure in the basic storage database into a predetermined data storage structure according to the predetermined data storage format;
and constructing the data table according to the preset data storage structure and the preset data storage format.
3. The warehousing method according to claim 2, wherein before the data storage structure in the basic storage database is adjusted to a predetermined data storage structure according to the predetermined data storage format, a data table is constructed according to the data storage structure and the data storage manner in the basic storage database, and further comprising:
automatically identifying the predetermined data storage format in the base storage database.
4. The warehousing method of claim 1, wherein dynamically managing the data tables using a configuration library comprises:
and dynamically managing the data table by adopting the plug-in function of the xml of the configuration library.
5. The warehousing method according to claim 1, wherein the index information is a primary key of the original data, the primary key being a unique identifier of the original data, the primary key being composed of a generation date of the original data, a code value of the original data, and a hash value of the original data.
6. A warehousing method according to claim 2, characterized in that the predetermined data storage format comprises at least one of:
time format, age format, name format.
7. The warehousing method of any of claims 1-6, wherein the data storage structure comprises at least one of: the system comprises a first data storage structure and a second data storage structure, wherein the first data storage structure corresponds to the data storage structure corresponding to the information of people, the first data storage structure is composed of name information, age information and native place information, the second data storage structure corresponds to the information of companies, and the second data storage structure is composed of legal information, company position information and company annual income information.
8. A big data-based data warehousing device, comprising:
the system comprises an establishing unit, a searching unit and a searching unit, wherein the establishing unit is used for establishing a basic storage database by adopting a distributed column storage database and a distributed searching engine, the distributed column storage database is used for storing original data, and the distributed searching engine is used for storing index information of the original data;
the optimization unit is used for constructing a data table according to a data storage structure and a data storage format in the basic storage database;
the management unit is used for dynamically managing the data table in a configuration library mode;
and the warehousing unit is used for warehousing data based on the data table.
9. A storage medium characterized by comprising a stored program, wherein the program executes the warehousing method of any one of claims 1 to 7.
10. A processor, configured to execute a program, wherein the program executes to perform the binning method according to any of claims 1 to 7.
CN201911399211.6A 2019-12-30 2019-12-30 Big data based data warehousing method and device, storage medium and processor Active CN111159192B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911399211.6A CN111159192B (en) 2019-12-30 2019-12-30 Big data based data warehousing method and device, storage medium and processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911399211.6A CN111159192B (en) 2019-12-30 2019-12-30 Big data based data warehousing method and device, storage medium and processor

Publications (2)

Publication Number Publication Date
CN111159192A true CN111159192A (en) 2020-05-15
CN111159192B CN111159192B (en) 2023-09-05

Family

ID=70559616

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911399211.6A Active CN111159192B (en) 2019-12-30 2019-12-30 Big data based data warehousing method and device, storage medium and processor

Country Status (1)

Country Link
CN (1) CN111159192B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113076311A (en) * 2020-01-03 2021-07-06 上海亲平信息科技股份有限公司 Distributed database
CN114356851A (en) * 2022-01-12 2022-04-15 北京字节跳动网络技术有限公司 Data file storage method and device, electronic equipment and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050091188A1 (en) * 2003-10-24 2005-04-28 Microsoft Indexing XML datatype content system and method
US20060036631A1 (en) * 2004-08-10 2006-02-16 Palo Alto Research Center Incorporated High performance XML storage retrieval system and method
US20080091698A1 (en) * 2006-10-17 2008-04-17 International Business Machines Corporation Optimal data storage and access for clustered data in a relational database
WO2011130706A2 (en) * 2010-04-16 2011-10-20 Salesforce.Com, Inc. Methods and systems for performing cross store joins in a multi-tenant store
WO2014123529A1 (en) * 2013-02-07 2014-08-14 Hewlett-Packard Development Company, L.P. Formatting semi-structured data in a database
US20160239527A1 (en) * 2015-02-16 2016-08-18 Naver Corporation Systems, apparatuses, methods, and computer readable media for processing and analyzing big data using columnar index data format
EP3128445A1 (en) * 2015-08-05 2017-02-08 Sap Se Data archive vault in big data platform
US20170235845A1 (en) * 2015-12-29 2017-08-17 Teradata Us, Inc. Non-unique secondary indexing of semi-structured data in databases
CN110276002A (en) * 2019-06-26 2019-09-24 浙江大搜车软件技术有限公司 Search for application data processing method, device, computer equipment and storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050091188A1 (en) * 2003-10-24 2005-04-28 Microsoft Indexing XML datatype content system and method
US20060036631A1 (en) * 2004-08-10 2006-02-16 Palo Alto Research Center Incorporated High performance XML storage retrieval system and method
US20080091698A1 (en) * 2006-10-17 2008-04-17 International Business Machines Corporation Optimal data storage and access for clustered data in a relational database
WO2011130706A2 (en) * 2010-04-16 2011-10-20 Salesforce.Com, Inc. Methods and systems for performing cross store joins in a multi-tenant store
WO2014123529A1 (en) * 2013-02-07 2014-08-14 Hewlett-Packard Development Company, L.P. Formatting semi-structured data in a database
US20160239527A1 (en) * 2015-02-16 2016-08-18 Naver Corporation Systems, apparatuses, methods, and computer readable media for processing and analyzing big data using columnar index data format
EP3128445A1 (en) * 2015-08-05 2017-02-08 Sap Se Data archive vault in big data platform
US20170235845A1 (en) * 2015-12-29 2017-08-17 Teradata Us, Inc. Non-unique secondary indexing of semi-structured data in databases
CN110276002A (en) * 2019-06-26 2019-09-24 浙江大搜车软件技术有限公司 Search for application data processing method, device, computer equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
蒋维;郝文宁;杨晓恝;靳大尉;: "分布式数据库搜索引擎的索引建立和优化" *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113076311A (en) * 2020-01-03 2021-07-06 上海亲平信息科技股份有限公司 Distributed database
CN113076311B (en) * 2020-01-03 2023-04-11 上海亲平信息科技股份有限公司 Distributed database
CN114356851A (en) * 2022-01-12 2022-04-15 北京字节跳动网络技术有限公司 Data file storage method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN111159192B (en) 2023-09-05

Similar Documents

Publication Publication Date Title
US8775425B2 (en) Systems and methods for massive structured data management over cloud aware distributed file system
CN105183735B (en) The querying method and inquiry unit of data
CN106547784B (en) Data splitting and storing method and device
CN107622091A (en) A kind of data base query method and device
EP3640813B1 (en) Cluster-based random walk method and apparatus
CN106844288B (en) Random character string generation method and device
US10936640B2 (en) Intelligent visualization of unstructured data in column-oriented data tables
CN111159192B (en) Big data based data warehousing method and device, storage medium and processor
CN115599764A (en) Method, device and medium for migrating table data
US11645279B2 (en) Index selection for database query
CN107430633B (en) System and method for data storage and computer readable medium
CN111177782A (en) Method and device for extracting distributed data based on big data and storage medium
CN110019544B (en) Data query method and system
CN112749157A (en) Data table processing method and device, storage medium and equipment
CN110019357B (en) Database query script generation method and device
CN112463785B (en) Data quality monitoring method and device, electronic equipment and storage medium
CN111125087A (en) Data storage method and device
CN115129787A (en) Method and device for maintaining block chain data, electronic equipment and storage medium
US20220342887A1 (en) Predictive query processing
US11640414B2 (en) Generating workflow, report, interface, conversion, enhancement, and forms (WRICEF) objects for enterprise software
US20210349903A1 (en) Row secure table plan generation
CN110968580B (en) Method and device for creating data storage structure
CN112749189A (en) Data query method and device
CN106557469B (en) Method and device for processing data in data warehouse
CN104239576A (en) Method and device for searching for all lines in column values of HBase list

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20200803

Address after: 1608, 14 / F, No. 65, Beisihuan West Road, Haidian District, Beijing 100080

Applicant after: BEIJING INTERNETWARE Ltd.

Address before: No. 603, floor 6, No. 9, Shangdi 9th Street, Haidian District, Beijing 100085

Applicant before: Smart Shenzhou (Beijing) Technology Co.,Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant