WO2014015488A1

WO2014015488A1 - Method and apparatus for data storage and query

Info

Publication number: WO2014015488A1
Application number: PCT/CN2012/079155
Authority: WO
Inventors: 韩建中
Original assignee: 华为技术有限公司
Priority date: 2012-07-25
Filing date: 2012-07-25
Publication date: 2014-01-30
Also published as: CN102906751B; CN102906751A

Abstract

Disclosed are a method and apparatus for data storage and query, which involves technical field of communication network and improves the speed of data storage and query. The solution provided by the embodiments of the present invention obtains the data to be stored through the cloud storage device, distributes said data to be stored to each cloud storage data node uniformly, and stores said data to be stored to the cloud storage distributed database parallelly. And the cloud storage device obtains the index fields entered by the user, generates query commands according to said index fields; sends said query commands to each cloud storage data node, queries data in the cloud storage distributed database parallelly; and sends a set of query results of said each cloud storage data node to said user. The solution provided by the embodiments of the present invention is suitable for use while storing and querying data.

Description

Method and device for data storage and data query

The present invention relates to the field of communication network technologies, and in particular, to a data storage and data query method and apparatus. Background technique

C loud Computing is the product of the development of distributed processing, parallel processing and grid computing. Cloud storage is an extension and development of cloud computing. It refers to a large number of storage devices in the network working together through clustering applications, grid technologies or distributed file systems, distributed databases, etc., to provide data storage and externally. A system of business access functions.

Currently, relational databases store data in rows and columns. Take the orac le database CDR list as an example. Generally, each CDR record exists in the form of a row in the database table. Each row contains: number, number of the other party, duration of the call, duration of the call, and the like. The data is stored in the form of data blocks (orac le data blocks). The data block is the smallest storage unit of oracel, which occupies a certain amount of disk space (such as a 16k block), that is, Orac le each time I / O (input / output, input and output) operations are in blocks, for example, although a word It only has 100 bytes, but at least one block of data is read when querying. If this statement spans two data blocks, you need to read 2 blocks.

The file system can also be used for data storage and query. For example, the detailed billing and billing data are stored as files in the file system. Among them, the file system can classify data by region, time (such as account period), number, etc., and directly store structured records in files by text or other means. Usually, the file system adopts a storage method based on time as a directory structure, for example, a directory is created according to time (account period) and a user number segment, and a record file is created in units of numbers. When you need to query data, you can create a simple index by means of directory hierarchy, file name, and so on. In the process of querying data, it is necessary to retrieve the massive metadata of the file system, read all the stored files, perform decompression operations, and perform data retrieval at the application layer. However, when the prior art is used for mass data storage and data query, the storage and query speed is slow. Summary of the invention

Embodiments of the present invention provide a data storage and data query method and apparatus, which can improve the speed of storing and retrieving data.

Embodiments of the present invention adopt the following technical solutions:

A method of data storage, including:

The cloud storage device obtains data to be saved;

The cloud storage device distributes the data to be saved to each cloud storage data node, and stores the data to be saved in parallel to the distributed database of the cloud storage.

A method of data query, including:

The cloud storage device obtains an index field input by the user, and generates a query instruction according to the index field; the cloud storage device sends the query instruction to each cloud storage data node, and queries the data in a distributed database of the cloud storage in parallel. ;

The cloud storage device sends the set of query results of the cloud storage nodes to the user.

A device for data storage, comprising:

An obtaining module, configured to obtain data to be saved;

And a storage module, configured to distribute the data to be saved to each cloud storage data node, and store the data to be saved in a distributed database of the cloud storage.

A device for data query, comprising:

An obtaining module, configured to obtain an index field input by a user, and generate a query instruction according to the index field;

a processing module, configured to send the query instruction to a data node of each cloud storage, query data in a distributed database of the cloud storage in parallel, and send the set of query results of the cloud storage nodes to the user.

A data storage system includes: a terminal and a cloud storage device; The terminal is configured to extract data in the data source according to the configured data extraction rule, to obtain the first data, and save the first data in a temporary folder, so that the cloud storage device obtains the rule according to the data, and obtains The transit zone path uploads the first data in the temporary folder to a corresponding directory in a temporary file transfer area of the cloud storage device;

The cloud storage device is configured to upload the first data in the temporary folder in the terminal to the temporary file transfer area of the cloud storage according to the configured data uploading rule and the transit zone path In the corresponding directory, the first data in the corresponding directory of the temporary file transfer area is evenly distributed to each cloud storage data node, and the to-be-saved data is stored in parallel to the distributed database of the cloud storage.

The embodiment of the present invention improves a method and a device for data storage and data query, and acquires data to be saved by using a cloud storage device; the cloud storage device uniformly distributes the data to be saved to each cloud storage data node, and in parallel The data to be saved is stored in a distributed database of cloud storage. And the cloud storage device obtains an index field input by the user, and generates a query instruction according to the index field; the cloud storage device sends the query instruction to each cloud storage data node, and queries the cloud storage distributed database in parallel. Data; the cloud storage device sends the set of query results of the cloud storage nodes to the user. When data storage and data query are performed in the prior art, when a relational database is used to access data, access is performed in units of blocks, resulting in slower storage and query speed; when using a file system to access data When the operation is a pure file operation, the query cannot be performed according to the specified condition, which is difficult to manage, and the entire file needs to be read and decompressed, which results in a slower retrieval speed. The solution provided by the embodiment of the present invention can provide Parallel storage and parallel data queries can increase the speed at which data is stored and retrieved. DRAWINGS

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below. Obviously, the drawings in the following description are only It is a certain embodiment of the present invention, and other drawings can be obtained from those skilled in the art without any inventive labor.

1 is a flowchart of a method for data storage according to Embodiment 1 of the present invention; 2 is a flowchart of a method for data query according to Embodiment 1 of the present invention; FIG. 3 is a block diagram of a device for data storage according to Embodiment 1 of the present invention;

4 is a block diagram of an apparatus for data query according to Embodiment 1 of the present invention;

5A is a flowchart of a data storage and data query method according to Embodiment 2 of the present invention; FIG. 5B is a schematic diagram of a data storage and data query method according to Embodiment 2 of the present invention; A block diagram of an apparatus for data storage provided in Embodiment 2;

Figure Ί is a block diagram of an apparatus for data query provided by Embodiment 2 of the present invention;

FIG. 8 is a schematic diagram of a system for data storage according to Embodiment 2 of the present invention. detailed description

The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.

Example 1

An embodiment of the present invention provides a data storage method. As shown in FIG. 1 , the method includes: Step 101: A cloud storage device acquires data to be saved.

Before the step, the method further includes: initial configuration of the cloud storage rule, including defining a rule of the directory and the sub-service type, defining an import rule of the file corresponding to the sub-service type, and defining a life cycle of the data, where the life cycle refers to each time defined by time. Storage strategy of class data; initial configuration of data extraction rules, including data source for extracting data, number of extraction processes, data range corresponding to each extraction process; initial configuration of data uploading rules, including number of uploading processes, each The data range corresponding to the upload process.

Optionally, extracting data in the external data source according to the configured data extraction rule to obtain the first data, or performing format conversion on the data in the external data source to obtain the second data;

Obtaining a transit path of the cloud storage according to the management node of the cloud storage;

According to the configured data uploading rule, and the transit zone path, the first data or The second data is saved in the corresponding directory of the temporary file transfer area of the cloud storage, wherein the temporary storage step 102, the cloud storage device distributes the data to be saved to each cloud storage data node, in parallel The data to be saved is stored in a distributed database of cloud storage.

Optionally, the cloud storage device uniformly distributes the to-be-saved data to each cloud storage data node according to a hash algorithm;

The cloud storage device stores the different data to be saved on the cloud storage data nodes in a distributed database of the cloud storage, or the same one on the cloud storage data nodes The fragmented data is saved and distributed to a distributed database of cloud storage.

Optionally, when the cloud storage device uniformly distributes the to-be-saved data to each cloud storage data node, and stores the to-be-saved data in a distributed database of the cloud storage in parallel, along with the cloud The increase in storage data nodes automatically increases the parallelism of parallel storage.

Optionally, before the cloud storage device uniformly distributes the data to be saved to each cloud storage data node, and before storing the data to be saved in the distributed database of the cloud storage in parallel, the method further includes:

Determining a sub-service type of the to-be-saved data according to a corresponding directory of the temporary file transfer area, and a rule of the directory and the sub-service type in the configured cloud storage rule;

And the data to be saved is evenly distributed to each cloud storage node according to the importing rule of the corresponding file of the sub-service type in the cloud storage rule, and the data to be saved in the first file is stored in parallel to Cloud storage in the database.

Further, different times of the data to be saved in the distributed database of the cloud storage are processed differently according to the configured rules of the data life cycle in the cloud storage rule.

In addition, data for different uses stored in the distributed database is saved as different number of copies according to the configured cloud storage rule, wherein the data of different uses includes production data and backup data, and the production data Used for querying.

The embodiment of the invention provides a data storage method, by distributing the data to be saved to On each cloud storage data node, the data to be saved is stored in parallel to the distributed database of the cloud storage, and the distributed storage of the data records in the database enables the data to be saved quickly.

An embodiment of the present invention provides a data query method. As shown in FIG. 2, the method includes: Step 201: A cloud storage device acquires an index field input by a user, and generates a query instruction according to the index field.

For example, the user can input the mobile phone number and the month of the detailed list to be queried, and can generate a query command according to the mobile phone number and the month to perform subsequent query operations.

Further, the index field of the user input is received through the query interface.

Step 202: The cloud storage device sends the query instruction to each cloud storage data node, and queries the data in a distributed database of the cloud storage in parallel;

The saved data can be divided into production data and backup data. When querying, only the production data is queried. When the production data is destroyed, the backup data can be used to recover the production data. When querying data, you don't care where the data is stored and whether it is compressed.

The cloud storage device sends the query instruction to the data node of each cloud storage at the same time; the cloud storage device simultaneously queries the distributed database carried by the data nodes of the cloud storage to meet the query instruction. data.

Step 203: The cloud storage device sends the set of query results of the cloud storage nodes to the user.

Optionally, the cloud storage device sorts the query results of the cloud storage nodes according to a user-defined rule, and sends the sorted query result set to the user; or

The cloud storage device sorts the query results of the cloud storage nodes in the order of the nodes, and sends the sorted query result sets to the user; or

The cloud storage device sequentially performs the query result of each cloud storage node according to keywords in the query result, and sends the sorted query result set to the user.

The embodiment of the invention provides a data query method. According to the query instruction, each cloud storage data node queries data in a distributed database of the cloud storage in parallel, so that the query performance can be greatly improved. The embodiment of the present invention provides a data storage device, which may be a cloud storage device, as shown in FIG. 3, the device includes: an obtaining module 301, a storage module 302;

The obtaining module 301 is configured to obtain data to be saved;

Further, the data obtaining unit in the obtaining module 301 is configured to extract data in an external data source according to the configured data extraction rule to obtain first data, or format data in the external data source. Obtaining second data after conversion;

a data uploading unit in the acquiring module 301, configured to acquire a transit area path of the cloud storage according to the management node of the cloud storage; and the first data according to the configured data uploading rule, and the transit area path The second data is saved to the corresponding destination of the temporary file transfer area of the cloud storage.

Further, the device further includes an initial configuration module, configured to initially configure a cloud storage rule, including a rule for defining a directory and a sub-service type, an import rule for defining a file corresponding to the sub-service type, and defining a life cycle of the data, where The life cycle refers to the storage strategy for defining each type of data according to time; and the initial configuration of the data extraction rules, including the data source for extracting data, the number of extraction processes, the data range corresponding to each extraction process; and the initial uploading of data upload rules Configuration, including the number of upload processes, the data range corresponding to each upload process.

The storage module 302 is configured to evenly distribute the data to be saved to each cloud storage data node, and store the data to be saved in a distributed database of the cloud storage in parallel.

The distribution unit in the storage module 302 is configured to uniformly distribute the to-be-saved data to each cloud storage data node according to a hash algorithm;

a storage unit in the storage module 302, configured to simultaneously store different data to be saved on the cloud storage data nodes into a distributed database of the cloud storage, or store the data nodes in the cloud storage The split segments on the same data to be saved are simultaneously stored in a distributed database of cloud storage.

When the data to be saved is hooked to each cloud storage data node, and the data to be saved is stored in a distributed database of the cloud storage in parallel, the cloud storage data node increases. Plus, automatically increases the parallelism of parallel storage.

Further, the device further includes: a determining module, configured to determine, according to a corresponding directory of the temporary file transfer area, and a configured rule of the directory and the sub-service type in the cloud storage rule, the child of the to-be-saved data business type;

The storage module 302 is configured to distribute the data to be saved to the data nodes of each cloud storage according to the imported import rule of the sub-service type corresponding file in the cloud storage rule, and the Save the data to a distributed database of cloud storage.

Further, the device further includes: a management module, configured to perform, according to the configured rules of the data life cycle in the cloud storage rule, different periods of the data to be saved in the distributed database of the cloud storage Processing.

The storage module is further configured to:

The data of different uses stored in the distributed database is saved as different number of copies according to the configured cloud storage rule, wherein the data of the different uses includes production data and backup data, and the production data is used for querying. When used.

An embodiment of the present invention provides a device for storing data, where the storage module obtains data to be saved, and the storage module uniformly distributes the data to be saved to the data nodes of each cloud storage, and stores the data to be saved in parallel to the cloud storage in parallel. In the database, the distributed storage of data records in the database makes it possible to save data quickly.

The embodiment of the present invention provides a device for querying data, and the device may be a cloud storage device, as shown in FIG. 4, the device includes: an obtaining module 401, a processing module 402;

The obtaining module 401 is configured to obtain an index field input by the user, and generate a query instruction according to the index field;

The processing module 402 is configured to send the query instruction to each cloud storage data node, query data in a distributed database of the cloud storage in parallel, and send the set of the query results of the cloud storage nodes to the User.

The sending unit in the processing module 402 is configured to send the query instruction to the data nodes of each cloud storage at the same time; The processing unit in the processing module 402 is configured to simultaneously query data that meets the query instruction in a distributed database carried on the data nodes of each cloud storage.

Optionally, the processing module 402 is configured to:

Querying the query results of the cloud storage nodes according to a user-defined rule, and sending the sorted query result set to the user; or

Querying the query results of the cloud storage nodes in a node order, and sending the sorted query result sets to the user; or

The query results of the cloud storage nodes are sequenced according to keywords in the query result, and the sorted query result sets are sent to the user.

The embodiment of the invention provides a device for querying data. By acquiring the query instruction generated by the module, the processing module simultaneously queries the data in the database in parallel, so that the query performance can be greatly improved.

Example 2

The embodiment of the invention provides a data storage and data query method. As shown in FIG. 5, the method includes:

Step 501: The cloud storage device performs initial configuration on the cloud storage rule.

Optionally, the cloud storage device receives an initial configuration of the cloud storage rule by the administrator, including defining a basic rule of the storage, and defining a data life cycle.

Further, 1) defining a service type using cloud storage;

Specifically, 1 defines a business name; for example, a detailed business, a billing service, an electronic document service, and the like.

2 Define the number of saved cloud storage services; for example, set to save at least 2 copies of data, for example, you can set up a detailed service to save 3 copies.

3 Set the purpose of each data; for example, the data can be divided into production data or backup data, wherein the production data is used for querying data, the backup data is used for recovery of production data, and is not used for data query. For example, if you save 3 copies of the detailed business, set the first and second copies as production data, and the third copy as backup data, then only the first and second data accesses are normally provided. If the data or the second data is corrupted, you can choose to recover from the third data. Another In addition, three copies of data can be set for production, and the cloud storage scheduler evenly distributes requests among three pieces of data.

4 Define the data life cycle of the business type; where, the life cycle refers to the storage strategy for defining each type of data according to time.

Storage policies can include: no compression storage, compressed storage, and deletion. The compressed storage can define different compression algorithms, such as low-density compression, that is, the compression ratio of the data with high query efficiency is about 2:1; the moderate compression, that is, the query and storage space, and the compression of the data. The compression ratio is about 5:1; high-density compression, that is, the compression ratio is higher than the compression ratio of 8:1 for the data with better query efficiency.

In addition, different storage time ranges can adopt different storage policies. For example, different storage policies can be set when the data is stored in the database, the data storage is on the Xth day, and the data storage is in the Yth month.

The management of the database using the data lifecycle rules of the business type can automatically compress and clear the data, reduce the management difficulty, and improve the database usage rate.

In addition, production data and backup data can have different data life cycles. For example, production data can be stored without being compressed in the database, low-density compression after 30 days, and deleted after 90 days. The backup data can be compressed by medium density when stored in the database. High-density compression after 90 days, never deleted.

2) Define the sub-service type; for example, the detailed service, which can be divided into GSM (Globa l Stem tem of Mobi le communication), GSM voice companion, SMS companion, etc. The type is equivalent to a table of cloud storage.

By default, information such as the number of copies of a sub-business type inherits the settings of the business type. In addition, the sub-service type can also set its own number of saved copies, and the data life cycle of each copy.

3) Set the relationship between the cloud storage directory and the sub-service type. This relationship is set according to the principle of the longest path first. For example, if "/CDR/" corresponds to the default service, all files under the directory "/CDR/gsm_cdr/" (including files in subdirectories) belong to the sub-service "GSM voice" and are imported into the GSM voice bill list. All files under "/CDR/gprs/" (including files in subdirectories) are imported into the GPRS CDR list, and other files under the directory " /CDR/ " are imported into the "Default Business" table. That is, When a sub-service type is searched according to the directory, if there is five levels of a directory, the search starts from the fifth-level directory, if no directory exists in the fifth level, the search starts from the fourth-level directory, and so on.

4) Define import rules for files corresponding to these sub-service types, where the main settings include: Name of the information; Location of the information; Type of information, such as integer, decimal, string, large text (for storing images) , file), etc.; whether the information is data-distributed, for example, the entire data can be uniformly distributed according to the information; whether the information is time-type data, the life cycle of the data can be defined according to the field; when the information is time-type data , Set the time format, for example, using YYYY-MM-DD HH24: MI: SS format, etc.

For example, for GSM voice tickets you can set the following values:

The information name is: mobile phone number (such as 13606401754); information resolution position is 1; information type is string STRING, that is, according to string processing; information is data distribution type: Yes, for example, GSM detailed list is evenly distributed according to mobile phone number; Whether the information is time type data: not time type; no time format is set.

The format of the GSM voice bill is:

13606401754 1 01 1 053188163000 | 2011-12-31 09: 30: 00 | 51 1 0. 20 1 ···. . The phone number is the first field, and the location where the cloud storage parses the information is 1.

For another example: the information name is: call start time; information resolution position is 4; information type is string STRING, that is, according to string processing; information is data distribution type: Yes, for example, GSM detailed list is evenly distributed according to mobile phone number; Whether the information is time type data: It is time type; the time format is: YYYY-MM-DD HH24: MI : SS , such as 2011-12-31 09: 30: 00.

The information name is: CDR type; the information resolution position is 2; the information type is string STRING, that is, according to the string processing; whether the information is data distribution type: No; whether the information is time type data: not time type; Time format.

The information name is: the other party number (such as: 053188163000); the information resolution position is 3; the information type is string STRING, that is, according to the string processing; whether the information is data distribution type: No; whether the information is time type data: not time Type; no time format is set.

The information name is: the duration of the call (such as 51 seconds); the information resolution position is 5; the type of information is The string STRING, that is, processed according to the string; whether the information is data distribution type: No; whether the information is time type data: not time type; no time format is set.

The information name is: call cost (such as 0. 20 yuan); information resolution position is 6; information type is decimal type; information is data distribution type: no; information is time type data: not time type; no time is set format.

Step 502: The cloud storage device performs initial configuration on a data extraction rule, and initially configures a data upload rule.

The cloud storage device receives an initial configuration of the data extraction and data uploading rules by the administrator. Specifically, the data extraction rule includes the following contents: 1 an external data source used for extracting data, and a connection mode with the data source; 2 data extraction The number of processes, and the data range corresponding to each data extraction process, for example, data extraction according to region, mobile number segment, number mantissa, etc.; 3 size of extracted files, for example, the first file is 10M, and the number of extracted numbers is threshold For example, extract up to 100 phone numbers; 4 file storage path after data extraction. The data uploading rules include the following: The number of data uploading processes, and the data range corresponding to each data uploading process.

It should be noted that the steps 501 and 502 are performed for the implementation of the embodiment of the present invention. The execution sequence of the steps 501 and 502 is not strictly fixed. Step 501 may be performed first, or step 502 may be performed first.

Step 503: The cloud storage device extracts data in the external data source according to the configured data extraction rule, obtains the first data, or performs format conversion on the data in the external data source to obtain the second data.

It should be noted that the external data source can be a data source saved in the terminal.

The external data source can directly receive the data in a format that can be recognized by the cloud storage by converting the CDR format into a format that can be recognized by the cloud storage. In this case, the data is not extracted from the external data source, the second data after the format conversion is obtained, and then the received data is received. Data is uploaded and imported into a distributed database of cloud storage.

Each data extraction process performs data extraction according to the configured data extraction rules. The format of the extracted data is in the format of a text file, for example: a bill file, in the format:

13606400001 1 01 1 053188163000 1 2011-12-31 09: 30: 00 | 51 | 0. 20 1 ···. 13606400001101113906400128 I 2011-12-31 09: 35: 10165 I 0.401 ···..

13606401754101105318816300012011-12-31 09: 30: 0015110.201···.. 13606401754101113906400128 I 2011-12-31 09: 35: 10 I 65 I 0.401···.. where each line of the file represents a detailed call list, with vertical lines Split, each field is defined as:

1. Mobile number, for example, 13606400001;

2. Call type, where 01 stands for the caller and 02 stands for the caller;

3. The other party number, for example, 053188163000;

4. Call time, for example, 2011-12-31 09: 30: 00;

5. The length of the call (in seconds), for example, 51 seconds;

6. Call charges (yuan), for example, 0.2.

For example, the meaning of the first call detail list is 13606400001. The mobile phone owner dialed the number 053188163000 at 2011-12-31 09: 30: 00. The call duration is 51 seconds, and the call charge is 0.2 yuan.

Step 504: The cloud storage device acquires a transit path of the cloud storage according to the management node of the cloud storage.

For example, the data uploading process is connected to the management node of the cloud storage, and the uploading directory service provided by the cloud storage is invoked, wherein the data uploading directory service includes parameters: a service type, a sub-service type, and a data feature (for example, the Jinan area with the area code 531) , 20111201. The cloud storage management node determines the file directory that the data uploading process can use according to the service setting and the busyness of each cloud storage node, and organizes the file directory into a URL (Uniform / Universal Resource Locator, unified resource) The locator format is returned to the data uploading process. The directory in the URL format is the path of the transit zone to be obtained. For example, it can be ftp: 〃192.168.1. l/CDR/gsm_cdr/531/20111201/ _o

Step 505: The cloud storage device saves the first data or the second data to a corresponding directory of a temporary file transfer area of the cloud storage according to the configured data uploading rule and the transit area path, where Each file in each directory in the temporary file transfer area is saved in a text file format; According to the type of the specific data, the data is saved to the corresponding directory in the temporary file transfer area of the cloud storage, for example, the GSM format data is saved in all files under "/CDR/gsm_cdi7", and the GPRS format data is saved to "/" CDR/gpr s/" in all files.

It should be noted that when the amount of data to be stored in the distributed database of the cloud storage is very large, or when the network condition is not good, an interruption may occur during the import process, and thus the temporary establishment established in the embodiment of the present invention may be adopted. The file transfer area can ensure that the data is completely uploaded to the second directory of the cloud storage, and then imported into the distributed database of the cloud storage to improve the integrity of the transmitted data.

Step 506: The cloud storage device determines, according to the corresponding directory of the temporary file transfer area, and the configured rules of the directory and the sub-service type in the cloud storage rule, the sub-service type of the data to be saved;

The data import process of the cloud storage saves the transit zone temporary file to the distributed database of the cloud storage. First, the data import process determines the sub-service type according to the corresponding directory of the transit zone where the data to be imported is located. Specifically, the table name corresponding to the file in each directory is obtained from the predefined "directory and sub-service type" rule in the cloud storage rule initially configured in step 501, and then multiple data import processes scan multiple directories in parallel, thereby Determine the sub-service type under the second directory.

For example, when processing a file to be imported in the "/CDR/gsm_cdi7" directory (these have confirmed that the transfer is complete, data can be imported), it will be imported into the GSM voice bill table according to the predefined rules. When the sub-service type is determined according to the "/CDR/gsm_cdr/" directory, it can be determined that its sub-service type is a GSM voice list.

Step 507: The cloud storage device distributes the to-be-saved data to the data nodes of each cloud storage according to the imported import rule of the sub-service type corresponding file in the cloud storage rule, and the Save the data storage to a distributed database of cloud storage;

The cloud storage device stores the data to be saved according to a hash algorithm to each cloud storage data node; the cloud storage device stores the different data to be saved on the data node in the cloud storage device. At the same time, it is stored in the distributed database of the cloud storage, or the fragments that are split by the same data to be saved on the cloud storage data nodes are simultaneously stored in the distributed database of the cloud storage. According to the import rule of the file corresponding to the sub-service type defined in step 501, the file in the temporary file transfer area is imported into the distributed database of the cloud storage, wherein, when importing, according to the data distribution rule defined in the import rule, When the information is of a data distribution type, the data to be saved in the temporary file transfer area is automatically distributed to each cloud storage node according to the hash hash algorithm. For example, the mobile phone number can be uniformly distributed, and the record of the mobile phone number 1 in the data to be saved in the temporary file transfer area is distributed to the node A, and the record of the mobile phone number 2 is distributed to the node B.

Further, in this step, the cloud storage is automatically imported in parallel, and the degree of parallelism is automatically increased for the increase of the cloud storage node. For example, when there are three cloud storage nodes, the degree of parallelism is 3, and when the cloud storage nodes are four, The degree of parallelism is 4. It should be noted that the traditional import method does not import the database in parallel by default. Although it can be manually specified to import data in parallel, it will not be imported and read with the addition of hardware capabilities, which is more performance than the traditional data import method. Upgrade.

In addition, when the data to be saved is imported into the distributed database of the cloud storage, the data can be saved in multiple copies according to the configured cloud storage rules. For example, three copies of the data can be saved for the detailed service, and the first and second copies are production data. , used for query, and does not compress, the third is backup data, for medium density compression.

Step 508: The cloud storage device performs different processing on different periods of the data to be saved in the distributed database of the cloud storage according to the configured rules of the data life cycle in the cloud storage rule.

For example, for production data, low-density compression can be performed after 30 days, and deleted after 90 days. For backup data, the medium-density compression is performed while being stored in the distributed database of cloud storage, and high-density compression is performed after 90 days, and will never be deleted.

Automatically compressing and clearing the database according to the rules of the data life cycle can improve the storage rate of the database, and can reduce the workload of maintenance personnel and reduce the management difficulty.

Step 509: The cloud storage device acquires an index field input by a user, and generates a query instruction according to the index field.

For example, the index field can be the phone number, after the query month. The generated query instructions include Mobile number, query month.

Step 510: The cloud storage device sends the query instruction to each cloud storage data node, and queries the data in a distributed database of the cloud storage in parallel;

It should be noted that only the production data is queried during the query. The query application itself does not need to care about where the data is stored and whether it is compressed. The steps that can be parallelized in the secondary data query are decomposed into parallel executions on each storage node, which greatly improves query performance.

Step 511: The cloud storage device sends the set of query results of the cloud storage nodes to the user.

As shown in FIG. 5B, a cloud storage device acquires data to be saved in an external data source through a data extraction process. The data upload process acquires a temporary file and uploads it to a temporary file transfer area. Waiting for import into the distributed database of the cloud storage; the data import process obtains the files in the transfer area of the temporary file and imports them into the distributed database of the cloud storage in parallel. When the data in the distributed database is managed in the later stage, the data is compressed and cleared by the rules of the data life cycle. When the user needs to query the data in the distributed database of the cloud storage, the user can directly query through the query interface. The data storage and data query method provided by the embodiment of the invention can improve the speed of storing and retrieving data and reduce the management difficulty by providing parallel storage and parallel data query.

An embodiment of the present invention provides a data storage device, where the device may be a cloud storage device. As shown in FIG. 6, the device includes: an obtaining module 601, a data acquiring unit 6011, a data uploading unit 6012, a storage module 602, and a distribution unit. 6021, a storage unit 6022, an initial configuration module 603, a determination module 604, a management module 605;

The obtaining module 601 is configured to obtain data to be saved;

The storage module 602 is connected to the acquisition module 601, and is configured to evenly distribute the data to be saved to the data nodes of each cloud storage, and store the data to be saved in parallel to the distributed database of the cloud storage.

Further, the device further includes an initial configuration module 603, configured to initially configure a cloud storage rule, including a rule for defining a directory and a sub-service type, defining an import rule of a file corresponding to the sub-service type, and defining a life cycle of the data. The life cycle refers to a storage strategy for defining each type of data according to time; and initial configuration of the data extraction rules, including data sources for extracting data, number of extraction processes, data ranges corresponding to each extraction process; and data uploading rules The initial configuration, including the number of upload processes, the data range corresponding to each upload process.

Further, the obtaining module 601 includes: a data acquiring unit 6011, a data uploading unit 6012;

Optionally, the data obtaining unit 6011 is configured to extract, according to the configured data extraction rule, the data in the external data source to obtain the first data, or perform format conversion on the data in the external data source to obtain the second data. Data

The data uploading unit 6012 is connected to the data acquiring unit 6011, and configured to acquire a transit area path of the cloud storage according to the management node of the cloud storage; and the data uploading rule according to the configuration, and the path of the transit area The data or the second data is saved in a corresponding directory of the temporary file transfer area of the cloud storage, wherein each file in each directory in the temporary file transfer area is saved in a text file format.

Further, the device further includes a determining module 604; the determining module 604 and the storing The module 602 is configured to determine, according to the corresponding directory of the temporary file transfer area, and the configured rules of the directory and the sub-service type in the cloud storage rule, the sub-service type of the data to be saved; the storage module 602 And storing, according to the imported import rule of the sub-service type corresponding file in the cloud storage rule, the data to be saved is distributed to the data nodes of each cloud storage, and the data to be saved is stored in the cloud in parallel Stored in a distributed database.

Further, the distribution unit 6021 in the storage module 602 is configured to uniformly distribute the to-be-stored data to each cloud storage data node according to a hash algorithm;

The storage unit 6022 of the storage module 602 is configured to simultaneously store different data to be saved on the cloud storage data nodes into a distributed database of the cloud storage, or store the data in the cloud storage. The split segments of the same data to be saved on the node are simultaneously stored in a distributed database of cloud storage.

When the data to be saved is distributed to each cloud storage data node, and the data to be saved is stored in the distributed database of the cloud storage in parallel, the cloud storage data node increases automatically as the cloud storage data node increases. Parallelism of parallel storage.

The storage module 602 is further configured to: save, according to the configured cloud storage rule, data of different uses stored in the distributed database as different number of copies, where the data of the different uses includes production data and Backing up data, the production data is used for querying.

The device further includes: a management module 605, configured to perform different processing on different periods of the data to be saved in the distributed database of the cloud storage according to the configured rules of data lifecycle in the cloud storage rule . For example, for the detailed business, 3 copies of data can be saved, the first and second copies are production data, which are used for query, and are not compressed, and the third is backup data for medium density compression.

An embodiment of the present invention provides a data storage device, which acquires data to be saved by using an acquiring module. The storage module uniformly distributes the data to be saved to data nodes of each cloud storage, and stores the data to be saved to the cloud in parallel. In a distributed distributed database, you can increase the speed of data storage.

An embodiment of the present invention provides a device for querying data. The device may be a cloud storage device. As shown in FIG. 7, the device includes: an obtaining module 701, a processing module 702, and a sending unit 7021. Unit 7022;

The obtaining module 701 is configured to obtain an index field input by the user, and generate a query instruction according to the index field;

The processing module 702 is configured to send the query instruction to the data nodes of each cloud storage, query data in a distributed database of the cloud storage in parallel, and send the set of the query results of the cloud storage nodes to the User.

The sending unit 7021 in the processing module 702 is configured to send the query instruction to the data nodes of each cloud storage at the same time. The processing unit 7022 in the processing module 702 is configured to simultaneously be in the cloud. The distributed database that is carried on the stored data node queries the data that meets the query instruction.

Optionally, the processing module 702 is configured to:

The embodiment of the invention provides a device for querying data. The processing module queries the data in the database in parallel by the query module generated by the module, so that the query performance can be greatly improved.

It should be noted that the devices shown in FIG. 6 and FIG. 7 may be the same device, and the cloud storage device, that is, the cloud storage device, can perform data storage and data query functions at the same time.

The embodiment of the present invention provides a data storage system, as shown in FIG. 8, including a terminal 801 and a cloud storage device 802;

The terminal 801 is configured to pump data in the data source according to the configured data extraction rule. Obtaining, obtaining the first data; saving the first data in a temporary folder, so that the cloud storage device uploads the first data in the temporary folder according to the data uploading rule and the obtained transit zone path to The corresponding directory of the temporary file transfer area of the cloud storage device;

The cloud storage device 802 is configured to upload the first data in the temporary folder in the terminal to a temporary file transfer area of the cloud storage according to the configured data uploading rule and the transit zone path. And correspondingly distributing the first data in the corresponding directory of the temporary file transfer area to each cloud storage data node, and storing the to-be-saved data in parallel to the distributed database of the cloud storage.

The data source is data stored in the terminal 801. Each data extraction process performs data extraction according to the configured data extraction rules, and saves the extracted data to a directory corresponding to the data extraction process, and each directory includes multiple temporary files, such as saving the first data to be extracted. Is the first file, and saves the first file to the first directory corresponding to the data extraction process. Responsible for the extraction of Jinan 1360640 paragraph 1 account period, December 2011 GSM detailed list process, the corresponding directory is /531/gsm_cdr/201112/01/1360640.

A new file is generated when the size of the temporary file or the number of saved numbers reaches the configured data extraction rule threshold. For example, GSM_531_20111201_1360640.0020 represents the CDR file of the 16000640 segment of Jinan City in December 2011, the serial number is 0020. When the number of numbers saved in this CDR file reaches the configured data extraction rule threshold, GSM_531-20111201_1360640 is generated. 0021 file.

The cloud storage device 802 can be the device for data storage as described in FIG.

When the amount of data to be stored in the distributed database of the cloud storage is very large, or when the network condition is not good, an interruption may occur during the import process, so that the first data extracted may be saved in the embodiment of the present invention. In the temporary folder of the terminal, upload the data to the created temporary file transfer area, and ensure that the data is completely uploaded to the second directory of the cloud storage, and then imported into the distributed database of the cloud storage to improve the transmitted data. Integrity.

The above is only a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto, and any person skilled in the art can easily within the technical scope disclosed by the present invention. Any changes or substitutions are contemplated as being within the scope of the invention. Therefore, the scope of the invention should be determined by the scope of the claims.

Claims

claims

1. A data storage method, characterized by including:

The cloud storage device obtains the data to be saved;

The cloud storage device evenly distributes the data to be saved to each cloud storage data node, and stores the data to be saved in a distributed database of cloud storage in parallel.

2. The method according to claim 1, characterized in that, the cloud storage device evenly distributes the data to be saved to each cloud storage data node, and stores the data to be saved to the cloud storage distribution in parallel. database, including:

The cloud storage device evenly distributes the data to be saved to each cloud storage data node according to a hash algorithm;

The cloud storage device simultaneously stores the different data to be saved on each cloud storage data node into a distributed database of cloud storage, or stores the same data to be saved on each cloud storage data node. The fragments after saving and splitting the data are simultaneously stored in the distributed database of cloud storage.

3. The method according to claim 2, characterized in that when the cloud storage device distributes the data to be saved to each cloud storage data node, the data to be saved is stored in the cloud storage in parallel. In the distributed database, as the cloud storage data nodes increase, the parallelism of the parallel storage is automatically increased.

4. The method according to claim 1, characterized in that, before the cloud storage device obtains the data to be saved, the method further includes:

Initial configuration of cloud storage rules, including rules for defining directories and sub-business types, defining import rules for files corresponding to sub-business types, and defining the life cycle of data. The life cycle refers to defining the storage strategy for each type of data according to time;

Initial configuration of data extraction rules, including the data source of extracted data, the number of extraction processes, and the data range corresponding to each extraction process;

Initial configuration of data upload rules, including the number of upload processes and the data range corresponding to each upload process.

5. The method according to claim 4, characterized in that: obtaining the data to be saved includes: Extract the data in the external data source according to the configured data extraction rules to obtain the first data, or convert the format of the data in the external data source to obtain the second data;

Obtain the transit area path of cloud storage according to the management node of cloud storage;

The first data or the second data are saved to the corresponding directory of the temporary file transfer area of the cloud storage according to the configured data upload rules and the transfer area path, wherein the temporary file

6. The method according to claim 5, characterized in that, the cloud storage device evenly distributes the data to be saved to each cloud storage data node, and stores the data to be saved to the cloud storage in parallel. Before the distributed database, it also included:

Determine the sub-service type of the data to be saved according to the corresponding directory of the temporary file transfer area and the configured rules of the directory and sub-service type in the cloud storage rules;

According to the configured import rules of files corresponding to sub-business types in the cloud storage rules, the data to be saved is evenly distributed to the data nodes of each cloud storage, and the data to be saved is stored in the distributed cloud storage in parallel. in the database.

7. The method according to claim 6, characterized in that, after uniformly distributing the data to be saved to each cloud storage data node and storing the data to be saved in a cloud storage database in parallel , Also includes:

According to the configured data life cycle rules in the cloud storage rules, different processing is performed on the data to be saved in the distributed database of the cloud storage in different periods.

8. The method according to claim 4, characterized in that, after the parallel storage of the data to be saved into the distributed database of cloud storage, it further includes:

According to the configured cloud storage rules, data of different uses stored in the distributed database are saved in different numbers, wherein the data of different uses include production data and backup data, and the production data is available for query. when used.

9. A data query method, characterized by including:

The cloud storage device obtains the index field input by the user and generates a query instruction according to the index field; the cloud storage device sends the query instruction to each cloud storage data node, and performs the query on the cloud in parallel. Query data in the stored distributed database;

The cloud storage device sends a set of query results of each cloud storage node to the user.

10. The method according to claim 9, characterized in that the cloud storage device sends the query instruction to the data nodes of each cloud storage, and queries the data in the cloud storage database in parallel, including:

The cloud storage device sends the query instruction to the data nodes of each cloud storage at the same time; the cloud storage device simultaneously queries the distributed database carried on the data nodes of each cloud storage for information that meets the query instruction. data.

11. The method according to claim 6, characterized in that the cloud storage device sends a set of query results of each cloud storage node to the user, including:

The cloud storage device sorts the query results of each cloud storage node according to user-defined rules, and sends the sorted query result set to the user; or,

The cloud storage device sorts the query results of each cloud storage node in node order, and sends the sorted query result set to the user; or,

The cloud storage device sorts the query results of each cloud storage node according to the keywords in the query results, and sends the sorted query result set to the user.

12. A data storage device, characterized by including:

Obtain module, used to obtain data to be saved;

The storage module is used to evenly distribute the data to be saved to the data nodes of each cloud storage, and store the data to be saved in the distributed database of the cloud storage in parallel.

13. The device according to claim 12, characterized in that the storage module includes: a distribution unit for evenly distributing the data to be saved to each cloud storage data node according to a hash algorithm;

A storage unit configured to store different data to be saved on each cloud storage data node into a distributed database of cloud storage at the same time, or to store the same data to be saved on each cloud storage data node. The fragments after saving and splitting the data are simultaneously stored in the distributed database of cloud storage.

14. The device according to claim 13, characterized in that when the data to be saved is evenly Distributed to each cloud storage data node, when the data to be saved is stored in the distributed database of cloud storage in parallel, as the number of cloud storage data nodes increases, the degree of parallelism of the parallel storage is automatically increased.

15. The device according to claim 12, characterized in that the device further includes: an initial configuration module for initial configuration of cloud storage rules, including rules for defining directories and sub-service types, and defining sub-service types. The import rules of the corresponding files define the life cycle of the data. The life cycle refers to defining the storage strategy of each type of data according to time;

And perform initial configuration of data extraction rules, including the data source of extracted data, the number of extraction processes, and the data range corresponding to each extraction process;

And perform initial configuration of data upload rules, including the number of upload processes and the data range corresponding to each upload process.

16. The device according to claim 15, characterized in that, the acquisition module includes: a data acquisition unit, configured to extract data from an external data source according to the configured data extraction rules to obtain the first data, Or obtain the second data after format conversion of the data in the external data source;

A data upload unit, configured to obtain the transit area path of the cloud storage according to the management node of the cloud storage; and save the first data or the second data to the cloud storage according to the configured data upload rules and the transit area path. in the corresponding directory of the temporary file transfer area, where the temporary file transfer

17. The device according to claim 16, characterized in that, the device further includes: a determination module, configured to determine the corresponding directory according to the temporary file transfer area, and the directory and directory in the configured cloud storage rule. The rules of the sub-business type determine the sub-business type of the data to be saved; the storage module is used to check all the data to be saved according to the import rules of the files corresponding to the sub-business type in the configured cloud storage rules. Distributed to the data nodes of each cloud storage, and store the data to be saved in the distributed database of the cloud storage in parallel.

18. The device according to claim 17, characterized in that, the device further includes: a management module, configured to manage the distributed distribution of the cloud storage according to the configured data life cycle rules in the cloud storage rules. The data to be saved in the database is processed differently at different periods.

19. The device according to claim 15, wherein the storage module is further configured to: save data for different purposes stored in the distributed database into different parts according to the configured cloud storage rules. Number, wherein the data for different purposes includes production data and backup data, and the production data is used for query.

20. A data query device, characterized by including:

The acquisition module is used to obtain the index fields input by the user and generate query instructions based on the index fields;

A processing module, configured to send the query instructions to the data nodes of each cloud storage, query data in the distributed database of the cloud storage in parallel; and send a set of query results of each cloud storage node to the user.

21. The device according to claim 20, characterized in that the processing module includes: a sending unit for sending the query instructions to the data nodes of each cloud storage at the same time; a processing unit for simultaneously sending Query data that conforms to the query instructions in the distributed database carried on the data nodes of each cloud storage.

22. The device according to claim 20, characterized in that the processing module is configured to: sort the query results of each cloud storage node according to user-defined rules, and send the sorted query result set to said user; or,

Sort the query results of each cloud storage node according to the node order, and send the sorted query result set to the user; or,

The query results of each cloud storage node are ordered according to the keywords in the query results, and the sorted query result set is sent to the user.

23. A data storage system, characterized in that it includes: a terminal and a cloud storage device; the terminal is used to extract data from the data source according to configured data extraction rules to obtain the first data; One data is saved in a temporary folder, so that the cloud storage device uploads the first data in the temporary folder to the temporary file transfer area of the cloud storage device according to the data upload rules and the obtained transfer area path. in the corresponding directory;

The cloud storage device is used to upload data according to the configured data upload rules, and the transit area route Path, upload the first data in the temporary folder in the terminal to the corresponding directory of the temporary file transfer area of the cloud storage; evenly distribute the first data in the corresponding directory of the temporary file transfer area Distribute to each cloud storage data node, and store the data to be saved in the distributed database of cloud storage in parallel.

24. The system according to claim 23, characterized in that the cloud storage device is the data storage device described in claims 12 to 19.