WO2014015488A1 - Method and apparatus for data storage and query - Google Patents

Method and apparatus for data storage and query Download PDF

Info

Publication number
WO2014015488A1
WO2014015488A1 PCT/CN2012/079155 CN2012079155W WO2014015488A1 WO 2014015488 A1 WO2014015488 A1 WO 2014015488A1 CN 2012079155 W CN2012079155 W CN 2012079155W WO 2014015488 A1 WO2014015488 A1 WO 2014015488A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
cloud storage
saved
query
rules
Prior art date
Application number
PCT/CN2012/079155
Other languages
French (fr)
Chinese (zh)
Inventor
韩建中
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN201280000916.6A priority Critical patent/CN102906751B/en
Priority to PCT/CN2012/079155 priority patent/WO2014015488A1/en
Publication of WO2014015488A1 publication Critical patent/WO2014015488A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries

Definitions

  • the present invention relates to the field of communication network technologies, and in particular, to a data storage and data query method and apparatus. Background technique
  • C loud Computing is the product of the development of distributed processing, parallel processing and grid computing.
  • Cloud storage is an extension and development of cloud computing. It refers to a large number of storage devices in the network working together through clustering applications, grid technologies or distributed file systems, distributed databases, etc., to provide data storage and externally.
  • a system of business access functions
  • relational databases store data in rows and columns. Take the orac le database CDR list as an example. Generally, each CDR record exists in the form of a row in the database table. Each row contains: number, number of the other party, duration of the call, duration of the call, and the like.
  • the data is stored in the form of data blocks (orac le data blocks).
  • the data block is the smallest storage unit of oracel, which occupies a certain amount of disk space (such as a 16k block), that is, Orac le each time I / O (input / output, input and output) operations are in blocks, for example, although a word It only has 100 bytes, but at least one block of data is read when querying. If this statement spans two data blocks, you need to read 2 blocks.
  • the file system can also be used for data storage and query.
  • the detailed billing and billing data are stored as files in the file system.
  • the file system can classify data by region, time (such as account period), number, etc., and directly store structured records in files by text or other means.
  • the file system adopts a storage method based on time as a directory structure, for example, a directory is created according to time (account period) and a user number segment, and a record file is created in units of numbers.
  • you need to query data you can create a simple index by means of directory hierarchy, file name, and so on.
  • Embodiments of the present invention provide a data storage and data query method and apparatus, which can improve the speed of storing and retrieving data.
  • a method of data storage including:
  • the cloud storage device obtains data to be saved
  • the cloud storage device distributes the data to be saved to each cloud storage data node, and stores the data to be saved in parallel to the distributed database of the cloud storage.
  • a method of data query including:
  • the cloud storage device obtains an index field input by the user, and generates a query instruction according to the index field; the cloud storage device sends the query instruction to each cloud storage data node, and queries the data in a distributed database of the cloud storage in parallel. ;
  • the cloud storage device sends the set of query results of the cloud storage nodes to the user.
  • a device for data storage comprising:
  • An obtaining module configured to obtain data to be saved
  • a storage module configured to distribute the data to be saved to each cloud storage data node, and store the data to be saved in a distributed database of the cloud storage.
  • a device for data query comprising:
  • An obtaining module configured to obtain an index field input by a user, and generate a query instruction according to the index field
  • a processing module configured to send the query instruction to a data node of each cloud storage, query data in a distributed database of the cloud storage in parallel, and send the set of query results of the cloud storage nodes to the user.
  • a data storage system includes: a terminal and a cloud storage device;
  • the terminal is configured to extract data in the data source according to the configured data extraction rule, to obtain the first data, and save the first data in a temporary folder, so that the cloud storage device obtains the rule according to the data, and obtains
  • the transit zone path uploads the first data in the temporary folder to a corresponding directory in a temporary file transfer area of the cloud storage device;
  • the cloud storage device is configured to upload the first data in the temporary folder in the terminal to the temporary file transfer area of the cloud storage according to the configured data uploading rule and the transit zone path In the corresponding directory, the first data in the corresponding directory of the temporary file transfer area is evenly distributed to each cloud storage data node, and the to-be-saved data is stored in parallel to the distributed database of the cloud storage.
  • the embodiment of the present invention improves a method and a device for data storage and data query, and acquires data to be saved by using a cloud storage device; the cloud storage device uniformly distributes the data to be saved to each cloud storage data node, and in parallel The data to be saved is stored in a distributed database of cloud storage. And the cloud storage device obtains an index field input by the user, and generates a query instruction according to the index field; the cloud storage device sends the query instruction to each cloud storage data node, and queries the cloud storage distributed database in parallel. Data; the cloud storage device sends the set of query results of the cloud storage nodes to the user.
  • FIG. 1 is a flowchart of a method for data storage according to Embodiment 1 of the present invention
  • 2 is a flowchart of a method for data query according to Embodiment 1 of the present invention
  • FIG. 3 is a block diagram of a device for data storage according to Embodiment 1 of the present invention
  • FIG. 4 is a block diagram of an apparatus for data query according to Embodiment 1 of the present invention.
  • FIG. 5A is a flowchart of a data storage and data query method according to Embodiment 2 of the present invention
  • FIG. 5B is a schematic diagram of a data storage and data query method according to Embodiment 2 of the present invention
  • Figure ⁇ is a block diagram of an apparatus for data query provided by Embodiment 2 of the present invention.
  • FIG. 8 is a schematic diagram of a system for data storage according to Embodiment 2 of the present invention. detailed description
  • An embodiment of the present invention provides a data storage method. As shown in FIG. 1 , the method includes: Step 101: A cloud storage device acquires data to be saved.
  • the method further includes: initial configuration of the cloud storage rule, including defining a rule of the directory and the sub-service type, defining an import rule of the file corresponding to the sub-service type, and defining a life cycle of the data, where the life cycle refers to each time defined by time.
  • Storage strategy of class data initial configuration of data extraction rules, including data source for extracting data, number of extraction processes, data range corresponding to each extraction process; initial configuration of data uploading rules, including number of uploading processes, each The data range corresponding to the upload process.
  • the first data or The second data is saved in the corresponding directory of the temporary file transfer area of the cloud storage, wherein the temporary storage step 102, the cloud storage device distributes the data to be saved to each cloud storage data node, in parallel
  • the data to be saved is stored in a distributed database of cloud storage.
  • the cloud storage device uniformly distributes the to-be-saved data to each cloud storage data node according to a hash algorithm
  • the cloud storage device stores the different data to be saved on the cloud storage data nodes in a distributed database of the cloud storage, or the same one on the cloud storage data nodes
  • the fragmented data is saved and distributed to a distributed database of cloud storage.
  • the cloud storage device uniformly distributes the to-be-saved data to each cloud storage data node, and stores the to-be-saved data in a distributed database of the cloud storage in parallel, along with the cloud
  • the increase in storage data nodes automatically increases the parallelism of parallel storage.
  • the method further includes:
  • the data to be saved is evenly distributed to each cloud storage node according to the importing rule of the corresponding file of the sub-service type in the cloud storage rule, and the data to be saved in the first file is stored in parallel to Cloud storage in the database.
  • data for different uses stored in the distributed database is saved as different number of copies according to the configured cloud storage rule, wherein the data of different uses includes production data and backup data, and the production data Used for querying.
  • the embodiment of the invention provides a data storage method, by distributing the data to be saved to On each cloud storage data node, the data to be saved is stored in parallel to the distributed database of the cloud storage, and the distributed storage of the data records in the database enables the data to be saved quickly.
  • An embodiment of the present invention provides a data query method. As shown in FIG. 2, the method includes: Step 201: A cloud storage device acquires an index field input by a user, and generates a query instruction according to the index field.
  • the user can input the mobile phone number and the month of the detailed list to be queried, and can generate a query command according to the mobile phone number and the month to perform subsequent query operations.
  • the index field of the user input is received through the query interface.
  • Step 202 The cloud storage device sends the query instruction to each cloud storage data node, and queries the data in a distributed database of the cloud storage in parallel;
  • the saved data can be divided into production data and backup data.
  • querying only the production data is queried.
  • the backup data can be used to recover the production data.
  • querying data you don't care where the data is stored and whether it is compressed.
  • the cloud storage device sends the query instruction to the data node of each cloud storage at the same time; the cloud storage device simultaneously queries the distributed database carried by the data nodes of the cloud storage to meet the query instruction. data.
  • Step 203 The cloud storage device sends the set of query results of the cloud storage nodes to the user.
  • the cloud storage device sorts the query results of the cloud storage nodes according to a user-defined rule, and sends the sorted query result set to the user; or
  • the cloud storage device sorts the query results of the cloud storage nodes in the order of the nodes, and sends the sorted query result sets to the user; or
  • the cloud storage device sequentially performs the query result of each cloud storage node according to keywords in the query result, and sends the sorted query result set to the user.
  • the embodiment of the invention provides a data query method.
  • each cloud storage data node queries data in a distributed database of the cloud storage in parallel, so that the query performance can be greatly improved.
  • the embodiment of the present invention provides a data storage device, which may be a cloud storage device, as shown in FIG. 3, the device includes: an obtaining module 301, a storage module 302;
  • the obtaining module 301 is configured to obtain data to be saved
  • the data obtaining unit in the obtaining module 301 is configured to extract data in an external data source according to the configured data extraction rule to obtain first data, or format data in the external data source. Obtaining second data after conversion;
  • a data uploading unit in the acquiring module 301 configured to acquire a transit area path of the cloud storage according to the management node of the cloud storage; and the first data according to the configured data uploading rule, and the transit area path
  • the second data is saved to the corresponding destination of the temporary file transfer area of the cloud storage.
  • the device further includes an initial configuration module, configured to initially configure a cloud storage rule, including a rule for defining a directory and a sub-service type, an import rule for defining a file corresponding to the sub-service type, and defining a life cycle of the data, where The life cycle refers to the storage strategy for defining each type of data according to time; and the initial configuration of the data extraction rules, including the data source for extracting data, the number of extraction processes, the data range corresponding to each extraction process; and the initial uploading of data upload rules Configuration, including the number of upload processes, the data range corresponding to each upload process.
  • a cloud storage rule including a rule for defining a directory and a sub-service type, an import rule for defining a file corresponding to the sub-service type, and defining a life cycle of the data, where The life cycle refers to the storage strategy for defining each type of data according to time; and the initial configuration of the data extraction rules, including the data source for extracting data, the number of extraction processes, the data range
  • the storage module 302 is configured to evenly distribute the data to be saved to each cloud storage data node, and store the data to be saved in a distributed database of the cloud storage in parallel.
  • the distribution unit in the storage module 302 is configured to uniformly distribute the to-be-saved data to each cloud storage data node according to a hash algorithm
  • a storage unit in the storage module 302 configured to simultaneously store different data to be saved on the cloud storage data nodes into a distributed database of the cloud storage, or store the data nodes in the cloud storage
  • the split segments on the same data to be saved are simultaneously stored in a distributed database of cloud storage.
  • the device further includes: a determining module, configured to determine, according to a corresponding directory of the temporary file transfer area, and a configured rule of the directory and the sub-service type in the cloud storage rule, the child of the to-be-saved data business type;
  • the storage module 302 is configured to distribute the data to be saved to the data nodes of each cloud storage according to the imported import rule of the sub-service type corresponding file in the cloud storage rule, and the Save the data to a distributed database of cloud storage.
  • the device further includes: a management module, configured to perform, according to the configured rules of the data life cycle in the cloud storage rule, different periods of the data to be saved in the distributed database of the cloud storage Processing.
  • a management module configured to perform, according to the configured rules of the data life cycle in the cloud storage rule, different periods of the data to be saved in the distributed database of the cloud storage Processing.
  • the storage module is further configured to:
  • the data of different uses stored in the distributed database is saved as different number of copies according to the configured cloud storage rule, wherein the data of the different uses includes production data and backup data, and the production data is used for querying. When used.
  • An embodiment of the present invention provides a device for storing data, where the storage module obtains data to be saved, and the storage module uniformly distributes the data to be saved to the data nodes of each cloud storage, and stores the data to be saved in parallel to the cloud storage in parallel.
  • the distributed storage of data records in the database makes it possible to save data quickly.
  • the embodiment of the present invention provides a device for querying data, and the device may be a cloud storage device, as shown in FIG. 4, the device includes: an obtaining module 401, a processing module 402;
  • the obtaining module 401 is configured to obtain an index field input by the user, and generate a query instruction according to the index field;
  • the processing module 402 is configured to send the query instruction to each cloud storage data node, query data in a distributed database of the cloud storage in parallel, and send the set of the query results of the cloud storage nodes to the User.
  • the sending unit in the processing module 402 is configured to send the query instruction to the data nodes of each cloud storage at the same time;
  • the processing unit in the processing module 402 is configured to simultaneously query data that meets the query instruction in a distributed database carried on the data nodes of each cloud storage.
  • processing module 402 is configured to:
  • the query results of the cloud storage nodes are sequenced according to keywords in the query result, and the sorted query result sets are sent to the user.
  • the embodiment of the invention provides a device for querying data.
  • the processing module By acquiring the query instruction generated by the module, the processing module simultaneously queries the data in the database in parallel, so that the query performance can be greatly improved.
  • the embodiment of the invention provides a data storage and data query method. As shown in FIG. 5, the method includes:
  • Step 501 The cloud storage device performs initial configuration on the cloud storage rule.
  • the cloud storage device receives an initial configuration of the cloud storage rule by the administrator, including defining a basic rule of the storage, and defining a data life cycle.
  • 1 defines a business name; for example, a detailed business, a billing service, an electronic document service, and the like.
  • each data can be divided into production data or backup data, wherein the production data is used for querying data, the backup data is used for recovery of production data, and is not used for data query.
  • the data can be divided into production data or backup data, wherein the production data is used for querying data, the backup data is used for recovery of production data, and is not used for data query.
  • the first and second copies are used for querying data
  • the backup data is used for recovery of production data, and is not used for data query.
  • the first and second copies as production data
  • the third copy as backup data
  • three copies of data can be set for production, and the cloud storage scheduler evenly distributes requests among three pieces of data.
  • the life cycle refers to the storage strategy for defining each type of data according to time.
  • Storage policies can include: no compression storage, compressed storage, and deletion.
  • the compressed storage can define different compression algorithms, such as low-density compression, that is, the compression ratio of the data with high query efficiency is about 2:1; the moderate compression, that is, the query and storage space, and the compression of the data.
  • the compression ratio is about 5:1; high-density compression, that is, the compression ratio is higher than the compression ratio of 8:1 for the data with better query efficiency.
  • different storage time ranges can adopt different storage policies. For example, different storage policies can be set when the data is stored in the database, the data storage is on the Xth day, and the data storage is in the Yth month.
  • the management of the database using the data lifecycle rules of the business type can automatically compress and clear the data, reduce the management difficulty, and improve the database usage rate.
  • production data and backup data can have different data life cycles.
  • production data can be stored without being compressed in the database, low-density compression after 30 days, and deleted after 90 days.
  • the backup data can be compressed by medium density when stored in the database. High-density compression after 90 days, never deleted.
  • sub-service type for example, the detailed service, which can be divided into GSM (Globa l Stem tem of Mobi le communication), GSM voice companion, SMS companion, etc.
  • GSM Global System for Mobile communications
  • SMS companion etc.
  • the type is equivalent to a table of cloud storage.
  • information such as the number of copies of a sub-business type inherits the settings of the business type.
  • the sub-service type can also set its own number of saved copies, and the data life cycle of each copy.
  • the main settings include: Name of the information; Location of the information; Type of information, such as integer, decimal, string, large text (for storing images) , file), etc.; whether the information is data-distributed, for example, the entire data can be uniformly distributed according to the information; whether the information is time-type data, the life cycle of the data can be defined according to the field; when the information is time-type data , Set the time format, for example, using YYYY-MM-DD HH24: MI: SS format, etc.
  • the information name is: mobile phone number (such as 13606401754); information resolution position is 1; information type is string STRING, that is, according to string processing; information is data distribution type: Yes, for example, GSM detailed list is evenly distributed according to mobile phone number; Whether the information is time type data: not time type; no time format is set.
  • the format of the GSM voice bill is:
  • the information name is: call start time; information resolution position is 4; information type is string STRING, that is, according to string processing; information is data distribution type: Yes, for example, GSM detailed list is evenly distributed according to mobile phone number; Whether the information is time type data: It is time type; the time format is: YYYY-MM-DD HH24: MI : SS , such as 2011-12-31 09: 30: 00.
  • the information name is: CDR type; the information resolution position is 2; the information type is string STRING, that is, according to the string processing; whether the information is data distribution type: No; whether the information is time type data: not time type; Time format.
  • the information name is: the other party number (such as: 053188163000); the information resolution position is 3; the information type is string STRING, that is, according to the string processing; whether the information is data distribution type: No; whether the information is time type data: not time Type; no time format is set.
  • the information name is: the duration of the call (such as 51 seconds); the information resolution position is 5; the type of information is The string STRING, that is, processed according to the string; whether the information is data distribution type: No; whether the information is time type data: not time type; no time format is set.
  • the information name is: call cost (such as 0. 20 yuan); information resolution position is 6; information type is decimal type; information is data distribution type: no; information is time type data: not time type; no time is set format.
  • Step 502 The cloud storage device performs initial configuration on a data extraction rule, and initially configures a data upload rule.
  • the cloud storage device receives an initial configuration of the data extraction and data uploading rules by the administrator.
  • the data extraction rule includes the following contents: 1 an external data source used for extracting data, and a connection mode with the data source; 2 data extraction The number of processes, and the data range corresponding to each data extraction process, for example, data extraction according to region, mobile number segment, number mantissa, etc.; 3 size of extracted files, for example, the first file is 10M, and the number of extracted numbers is threshold For example, extract up to 100 phone numbers; 4 file storage path after data extraction.
  • the data uploading rules include the following: The number of data uploading processes, and the data range corresponding to each data uploading process.
  • steps 501 and 502 are performed for the implementation of the embodiment of the present invention.
  • the execution sequence of the steps 501 and 502 is not strictly fixed. Step 501 may be performed first, or step 502 may be performed first.
  • Step 503 The cloud storage device extracts data in the external data source according to the configured data extraction rule, obtains the first data, or performs format conversion on the data in the external data source to obtain the second data.
  • the external data source can be a data source saved in the terminal.
  • the external data source can directly receive the data in a format that can be recognized by the cloud storage by converting the CDR format into a format that can be recognized by the cloud storage.
  • the data is not extracted from the external data source, the second data after the format conversion is obtained, and then the received data is received.
  • Data is uploaded and imported into a distributed database of cloud storage.
  • the format of the extracted data is in the format of a text file, for example: a bill file, in the format:
  • each field is defined as:
  • the other party number for example, 053188163000;
  • the length of the call (in seconds), for example, 51 seconds;
  • the meaning of the first call detail list is 13606400001.
  • the mobile phone owner dialed the number 053188163000 at 2011-12-31 09: 30: 00.
  • the call duration is 51 seconds, and the call charge is 0.2 yuan.
  • Step 504 The cloud storage device acquires a transit path of the cloud storage according to the management node of the cloud storage.
  • the data uploading process is connected to the management node of the cloud storage, and the uploading directory service provided by the cloud storage is invoked, wherein the data uploading directory service includes parameters: a service type, a sub-service type, and a data feature (for example, the Jinan area with the area code 531) , 20111201.
  • the cloud storage management node determines the file directory that the data uploading process can use according to the service setting and the busyness of each cloud storage node, and organizes the file directory into a URL (Uniform / Universal Resource Locator, unified resource)
  • the locator format is returned to the data uploading process.
  • the directory in the URL format is the path of the transit zone to be obtained. For example, it can be ftp: ⁇ 192.168.1. l/CDR/gsm_cdr/531/20111201/ o
  • Step 505 The cloud storage device saves the first data or the second data to a corresponding directory of a temporary file transfer area of the cloud storage according to the configured data uploading rule and the transit area path, where Each file in each directory in the temporary file transfer area is saved in a text file format; According to the type of the specific data, the data is saved to the corresponding directory in the temporary file transfer area of the cloud storage, for example, the GSM format data is saved in all files under "/CDR/gsm_cdi7", and the GPRS format data is saved to "/" CDR/gpr s/" in all files.
  • the file transfer area can ensure that the data is completely uploaded to the second directory of the cloud storage, and then imported into the distributed database of the cloud storage to improve the integrity of the transmitted data.
  • Step 506 The cloud storage device determines, according to the corresponding directory of the temporary file transfer area, and the configured rules of the directory and the sub-service type in the cloud storage rule, the sub-service type of the data to be saved;
  • the data import process of the cloud storage saves the transit zone temporary file to the distributed database of the cloud storage.
  • the data import process determines the sub-service type according to the corresponding directory of the transit zone where the data to be imported is located. Specifically, the table name corresponding to the file in each directory is obtained from the predefined "directory and sub-service type" rule in the cloud storage rule initially configured in step 501, and then multiple data import processes scan multiple directories in parallel, thereby Determine the sub-service type under the second directory.
  • Step 507 The cloud storage device distributes the to-be-saved data to the data nodes of each cloud storage according to the imported import rule of the sub-service type corresponding file in the cloud storage rule, and the Save the data storage to a distributed database of cloud storage;
  • the cloud storage device stores the data to be saved according to a hash algorithm to each cloud storage data node; the cloud storage device stores the different data to be saved on the data node in the cloud storage device. At the same time, it is stored in the distributed database of the cloud storage, or the fragments that are split by the same data to be saved on the cloud storage data nodes are simultaneously stored in the distributed database of the cloud storage.
  • the file in the temporary file transfer area is imported into the distributed database of the cloud storage, wherein, when importing, according to the data distribution rule defined in the import rule, the data to be saved in the temporary file transfer area is automatically distributed to each cloud storage node according to the hash hash algorithm.
  • the mobile phone number can be uniformly distributed, and the record of the mobile phone number 1 in the data to be saved in the temporary file transfer area is distributed to the node A, and the record of the mobile phone number 2 is distributed to the node B.
  • the cloud storage is automatically imported in parallel, and the degree of parallelism is automatically increased for the increase of the cloud storage node.
  • the degree of parallelism is 3, and when the cloud storage nodes are four, The degree of parallelism is 4.
  • the traditional import method does not import the database in parallel by default. Although it can be manually specified to import data in parallel, it will not be imported and read with the addition of hardware capabilities, which is more performance than the traditional data import method. Upgrade.
  • the data to be saved when the data to be saved is imported into the distributed database of the cloud storage, the data can be saved in multiple copies according to the configured cloud storage rules. For example, three copies of the data can be saved for the detailed service, and the first and second copies are production data. , used for query, and does not compress, the third is backup data, for medium density compression.
  • Step 508 The cloud storage device performs different processing on different periods of the data to be saved in the distributed database of the cloud storage according to the configured rules of the data life cycle in the cloud storage rule.
  • low-density compression can be performed after 30 days, and deleted after 90 days.
  • medium-density compression is performed while being stored in the distributed database of cloud storage, and high-density compression is performed after 90 days, and will never be deleted.
  • Automatically compressing and clearing the database according to the rules of the data life cycle can improve the storage rate of the database, and can reduce the workload of maintenance personnel and reduce the management difficulty.
  • Step 509 The cloud storage device acquires an index field input by a user, and generates a query instruction according to the index field.
  • the index field can be the phone number, after the query month.
  • the generated query instructions include Mobile number, query month.
  • the index field of the user input is received through the query interface.
  • Step 510 The cloud storage device sends the query instruction to each cloud storage data node, and queries the data in a distributed database of the cloud storage in parallel;
  • the cloud storage device sends the query instruction to the data node of each cloud storage at the same time; the cloud storage device simultaneously queries the distributed database carried by the data nodes of the cloud storage to meet the query instruction. data.
  • Step 511 The cloud storage device sends the set of query results of the cloud storage nodes to the user.
  • the cloud storage device sorts the query results of the cloud storage nodes according to a user-defined rule, and sends the sorted query result set to the user; or
  • the cloud storage device sorts the query results of the cloud storage nodes in the order of the nodes, and sends the sorted query result sets to the user; or
  • the cloud storage device sequentially performs the query result of each cloud storage node according to keywords in the query result, and sends the sorted query result set to the user.
  • a cloud storage device acquires data to be saved in an external data source through a data extraction process.
  • the data upload process acquires a temporary file and uploads it to a temporary file transfer area. Waiting for import into the distributed database of the cloud storage; the data import process obtains the files in the transfer area of the temporary file and imports them into the distributed database of the cloud storage in parallel.
  • the data in the distributed database is managed in the later stage, the data is compressed and cleared by the rules of the data life cycle.
  • the user needs to query the data in the distributed database of the cloud storage, the user can directly query through the query interface.
  • the data storage and data query method provided by the embodiment of the invention can improve the speed of storing and retrieving data and reduce the management difficulty by providing parallel storage and parallel data query.
  • An embodiment of the present invention provides a data storage device, where the device may be a cloud storage device.
  • the device includes: an obtaining module 601, a data acquiring unit 6011, a data uploading unit 6012, a storage module 602, and a distribution unit. 6021, a storage unit 6022, an initial configuration module 603, a determination module 604, a management module 605;
  • the obtaining module 601 is configured to obtain data to be saved
  • the storage module 602 is connected to the acquisition module 601, and is configured to evenly distribute the data to be saved to the data nodes of each cloud storage, and store the data to be saved in parallel to the distributed database of the cloud storage.
  • the device further includes an initial configuration module 603, configured to initially configure a cloud storage rule, including a rule for defining a directory and a sub-service type, defining an import rule of a file corresponding to the sub-service type, and defining a life cycle of the data.
  • the life cycle refers to a storage strategy for defining each type of data according to time; and initial configuration of the data extraction rules, including data sources for extracting data, number of extraction processes, data ranges corresponding to each extraction process; and data uploading rules
  • the initial configuration including the number of upload processes, the data range corresponding to each upload process.
  • the obtaining module 601 includes: a data acquiring unit 6011, a data uploading unit 6012;
  • the data obtaining unit 6011 is configured to extract, according to the configured data extraction rule, the data in the external data source to obtain the first data, or perform format conversion on the data in the external data source to obtain the second data.
  • the data uploading unit 6012 is connected to the data acquiring unit 6011, and configured to acquire a transit area path of the cloud storage according to the management node of the cloud storage; and the data uploading rule according to the configuration, and the path of the transit area
  • the data or the second data is saved in a corresponding directory of the temporary file transfer area of the cloud storage, wherein each file in each directory in the temporary file transfer area is saved in a text file format.
  • the device further includes a determining module 604; the determining module 604 and the storing
  • the module 602 is configured to determine, according to the corresponding directory of the temporary file transfer area, and the configured rules of the directory and the sub-service type in the cloud storage rule, the sub-service type of the data to be saved; the storage module 602 And storing, according to the imported import rule of the sub-service type corresponding file in the cloud storage rule, the data to be saved is distributed to the data nodes of each cloud storage, and the data to be saved is stored in the cloud in parallel Stored in a distributed database.
  • the distribution unit 6021 in the storage module 602 is configured to uniformly distribute the to-be-stored data to each cloud storage data node according to a hash algorithm
  • the storage unit 6022 of the storage module 602 is configured to simultaneously store different data to be saved on the cloud storage data nodes into a distributed database of the cloud storage, or store the data in the cloud storage.
  • the split segments of the same data to be saved on the node are simultaneously stored in a distributed database of cloud storage.
  • the cloud storage data node increases automatically as the cloud storage data node increases. Parallelism of parallel storage.
  • the storage module 602 is further configured to: save, according to the configured cloud storage rule, data of different uses stored in the distributed database as different number of copies, where the data of the different uses includes production data and Backing up data, the production data is used for querying.
  • the device further includes: a management module 605, configured to perform different processing on different periods of the data to be saved in the distributed database of the cloud storage according to the configured rules of data lifecycle in the cloud storage rule .
  • a management module 605 configured to perform different processing on different periods of the data to be saved in the distributed database of the cloud storage according to the configured rules of data lifecycle in the cloud storage rule .
  • 3 copies of data can be saved, the first and second copies are production data, which are used for query, and are not compressed, and the third is backup data for medium density compression.
  • An embodiment of the present invention provides a data storage device, which acquires data to be saved by using an acquiring module.
  • the storage module uniformly distributes the data to be saved to data nodes of each cloud storage, and stores the data to be saved to the cloud in parallel. In a distributed distributed database, you can increase the speed of data storage.
  • An embodiment of the present invention provides a device for querying data.
  • the device may be a cloud storage device.
  • the device includes: an obtaining module 701, a processing module 702, and a sending unit 7021.
  • the obtaining module 701 is configured to obtain an index field input by the user, and generate a query instruction according to the index field;
  • the processing module 702 is configured to send the query instruction to the data nodes of each cloud storage, query data in a distributed database of the cloud storage in parallel, and send the set of the query results of the cloud storage nodes to the User.
  • the saved data can be divided into production data and backup data.
  • querying only the production data is queried.
  • the backup data can be used to recover the production data.
  • querying data you don't care where the data is stored and whether it is compressed.
  • the sending unit 7021 in the processing module 702 is configured to send the query instruction to the data nodes of each cloud storage at the same time.
  • the processing unit 7022 in the processing module 702 is configured to simultaneously be in the cloud.
  • the distributed database that is carried on the stored data node queries the data that meets the query instruction.
  • processing module 702 is configured to:
  • the query results of the cloud storage nodes are sequenced according to keywords in the query result, and the sorted query result sets are sent to the user.
  • the embodiment of the invention provides a device for querying data.
  • the processing module queries the data in the database in parallel by the query module generated by the module, so that the query performance can be greatly improved.
  • the devices shown in FIG. 6 and FIG. 7 may be the same device, and the cloud storage device, that is, the cloud storage device, can perform data storage and data query functions at the same time.
  • the embodiment of the present invention provides a data storage system, as shown in FIG. 8, including a terminal 801 and a cloud storage device 802;
  • the terminal 801 is configured to pump data in the data source according to the configured data extraction rule. Obtaining, obtaining the first data; saving the first data in a temporary folder, so that the cloud storage device uploads the first data in the temporary folder according to the data uploading rule and the obtained transit zone path to The corresponding directory of the temporary file transfer area of the cloud storage device;
  • the cloud storage device 802 is configured to upload the first data in the temporary folder in the terminal to a temporary file transfer area of the cloud storage according to the configured data uploading rule and the transit zone path. And correspondingly distributing the first data in the corresponding directory of the temporary file transfer area to each cloud storage data node, and storing the to-be-saved data in parallel to the distributed database of the cloud storage.
  • the data source is data stored in the terminal 801.
  • Each data extraction process performs data extraction according to the configured data extraction rules, and saves the extracted data to a directory corresponding to the data extraction process, and each directory includes multiple temporary files, such as saving the first data to be extracted. Is the first file, and saves the first file to the first directory corresponding to the data extraction process.
  • the corresponding directory is /531/gsm_cdr/201112/01/1360640.
  • a new file is generated when the size of the temporary file or the number of saved numbers reaches the configured data extraction rule threshold.
  • GSM_531_20111201_1360640.0020 represents the CDR file of the 16000640 segment of Jinan City in December 2011, the serial number is 0020.
  • GSM_531-20111201_1360640 is generated. 0021 file.
  • the cloud storage device 802 can be the device for data storage as described in FIG.

Abstract

Disclosed are a method and apparatus for data storage and query, which involves technical field of communication network and improves the speed of data storage and query. The solution provided by the embodiments of the present invention obtains the data to be stored through the cloud storage device, distributes said data to be stored to each cloud storage data node uniformly, and stores said data to be stored to the cloud storage distributed database parallelly. And the cloud storage device obtains the index fields entered by the user, generates query commands according to said index fields; sends said query commands to each cloud storage data node, queries data in the cloud storage distributed database parallelly; and sends a set of query results of said each cloud storage data node to said user. The solution provided by the embodiments of the present invention is suitable for use while storing and querying data.

Description

一种数据存储、 数据查询的方法及装置 技术领域  Method and device for data storage and data query
本发明涉及通信网络技术领域, 尤其涉及一种数据存储、 数据查询的方 法及装置。 背景技术  The present invention relates to the field of communication network technologies, and in particular, to a data storage and data query method and apparatus. Background technique
云计算(C loud Comput ing )是分布式处理、 并行处理和网格计算发展的 产物。 云存储是对云计算的延伸和发展, 指通过集群应用、 网格技术或分布 式文件系统、 分布式数据库等, 将网络中大量的存储设备通过软件集合起来 协同工作, 共同对外提供数据存储和业务访问功能的一种系统。  C loud Computing is the product of the development of distributed processing, parallel processing and grid computing. Cloud storage is an extension and development of cloud computing. It refers to a large number of storage devices in the network working together through clustering applications, grid technologies or distributed file systems, distributed databases, etc., to provide data storage and externally. A system of business access functions.
目前, 关系型数据库以行和列的形式对数据进行存储。 以 orac le数据库 话单表为例, 一般每条话单记录在数据库表中是以行的形式存在, 每行都会 包含: 号码、 对方号码、 通话时间, 通话时长等多个字段。 数据在底层是以 数据块(orac le data block )形式保存的。 数据块是 oracel 的最小存储单 元, 占用一定的磁盘空间(如 16k的块), 即 Orac le每次 I /O ( input/output , 输入输出)操作都是以块为单位的, 例如虽然一条话单只有 100 字节, 但查 询时至少要读取一个块的数据。 如果这条话单跨两个数据块, 则需要读取 2 个块。  Currently, relational databases store data in rows and columns. Take the orac le database CDR list as an example. Generally, each CDR record exists in the form of a row in the database table. Each row contains: number, number of the other party, duration of the call, duration of the call, and the like. The data is stored in the form of data blocks (orac le data blocks). The data block is the smallest storage unit of oracel, which occupies a certain amount of disk space (such as a 16k block), that is, Orac le each time I / O (input / output, input and output) operations are in blocks, for example, although a word It only has 100 bytes, but at least one block of data is read when querying. If this statement spans two data blocks, you need to read 2 blocks.
也可以采用文件系统进行数据的存储与查询。 例如, 将详单、 账单数据 以文件的方式存储在文件系统中。 其中, 文件系统可以以地区、 时间 (例如 账期)、 号码等对数据分类, 并直接将结构化记录以文本或其他方式存储在文 件中。通常文件系统采用基于时间为目录结构的存储方式,例如按照时间(账 期)及用户号段等建立目录, 以号码为单位建立记录文件。 当需要查询数据 时, 可以采用目录层次、 文件名等方式建立简单索引。 查询数据过程中需要 检索文件系统海量元数据, 将存储的文件全部读入, 进行解压缩操作, 并在 应用层进行数据检索。 然而, 采用现有技术进行海量数据存储及数据查询时, 导致存储与查询 速度较慢。 发明内容 The file system can also be used for data storage and query. For example, the detailed billing and billing data are stored as files in the file system. Among them, the file system can classify data by region, time (such as account period), number, etc., and directly store structured records in files by text or other means. Usually, the file system adopts a storage method based on time as a directory structure, for example, a directory is created according to time (account period) and a user number segment, and a record file is created in units of numbers. When you need to query data, you can create a simple index by means of directory hierarchy, file name, and so on. In the process of querying data, it is necessary to retrieve the massive metadata of the file system, read all the stored files, perform decompression operations, and perform data retrieval at the application layer. However, when the prior art is used for mass data storage and data query, the storage and query speed is slow. Summary of the invention
本发明的实施例提供一种数据存储、 数据查询的方法及装置, 可以提高 存储和检索数据的速度。  Embodiments of the present invention provide a data storage and data query method and apparatus, which can improve the speed of storing and retrieving data.
本发明的实施例采用如下技术方案:  Embodiments of the present invention adopt the following technical solutions:
一种数据存储的方法, 包括:  A method of data storage, including:
云存储设备获取待保存数据;  The cloud storage device obtains data to be saved;
所述云存储设备将所述待保存数据均勾分布到各云存储数据节点上, 并 行地将所述待保存数据存储到云存储的分布式数据库中。  The cloud storage device distributes the data to be saved to each cloud storage data node, and stores the data to be saved in parallel to the distributed database of the cloud storage.
一种数据查询的方法, 包括:  A method of data query, including:
云存储设备获取用户输入的索引字段, 根据所述索引字段生成查询指令; 所述云存储设备将所述查询指令发送到各云存储的数据节点, 并行地在 云存储的分布式数据库中查询数据;  The cloud storage device obtains an index field input by the user, and generates a query instruction according to the index field; the cloud storage device sends the query instruction to each cloud storage data node, and queries the data in a distributed database of the cloud storage in parallel. ;
所述云存储设备将所述各云存储节点的查询结果的集合发送给所述用 户。  The cloud storage device sends the set of query results of the cloud storage nodes to the user.
一种数据存储的装置, 包括:  A device for data storage, comprising:
获取模块, 用于获取待保存数据;  An obtaining module, configured to obtain data to be saved;
存储模块, 用于将所述待保存数据均勾分布到各云存储数据节点上, 并 行地将所述待保存数据存储到云存储的分布式数据库中。  And a storage module, configured to distribute the data to be saved to each cloud storage data node, and store the data to be saved in a distributed database of the cloud storage.
一种数据查询的装置, 包括:  A device for data query, comprising:
获取模块, 用于获取用户输入的索引字段, 根据所述索引字段生成查询 指令;  An obtaining module, configured to obtain an index field input by a user, and generate a query instruction according to the index field;
处理模块, 用于将所述查询指令发送到各云存储的数据节点, 并行地在 在云存储的分布式数据库中查询数据; 以及将所述各云存储节点的查询结果 的集合发送给所述用户。  a processing module, configured to send the query instruction to a data node of each cloud storage, query data in a distributed database of the cloud storage in parallel, and send the set of query results of the cloud storage nodes to the user.
一种数据存储系统, 包括: 终端和云存储设备; 所述终端, 用于根据配置的数据抽取规则对数据源中的数据进行抽取, 获得第一数据; 将所述第一数据保存在临时文件夹中, 以便云存储设备根据 数据上传规则, 以及获取的中转区路径将所述临时文件夹中的所述第一数据 上传到所述云存储设备的临时文件中转区的相应目录中; A data storage system includes: a terminal and a cloud storage device; The terminal is configured to extract data in the data source according to the configured data extraction rule, to obtain the first data, and save the first data in a temporary folder, so that the cloud storage device obtains the rule according to the data, and obtains The transit zone path uploads the first data in the temporary folder to a corresponding directory in a temporary file transfer area of the cloud storage device;
所述云存储设备, 用于根据配置的所述数据上传规则, 以及所述中转区 路径, 将所述终端中的临时文件夹中的所述第一数据上传到云存储的临时文 件中转区的相应目录中; 将所述临时文件中转区的相应目录中的所述第一数 据均匀分布到各云存储数据节点上, 并行地将所述待保存数据存储到云存储 的分布式数据库中。  The cloud storage device is configured to upload the first data in the temporary folder in the terminal to the temporary file transfer area of the cloud storage according to the configured data uploading rule and the transit zone path In the corresponding directory, the first data in the corresponding directory of the temporary file transfer area is evenly distributed to each cloud storage data node, and the to-be-saved data is stored in parallel to the distributed database of the cloud storage.
本发明实施例提高一种数据存储、 数据查询的方法及装置, 通过云存储 设备获取待保存数据; 所述云存储设备将所述待保存数据均匀分布到各云存 储数据节点上, 并行地将所述待保存数据存储到云存储的分布式数据库中。 以及云存储设备获取用户输入的索引字段, 根据所述索引字段生成查询指令; 所述云存储设备将所述查询指令发送到各云存储的数据节点, 并行地在云存 储的分布式数据库中查询数据; 所述云存储设备将所述各云存储节点的查询 结果的集合发送给所述用户。 与现有技术中进行数据存储及数据查询时, 当 采用关系型数据库进行存取数据时, 都要以块为单位进行存取, 导致存储与 查询速度较慢; 当采用文件系统进行存取数据时, 由于为纯文件操作, 无法 按照指定条件查询, 导致管理比较困难, 并且检索时需要将全部文件读取, 进行解压缩, 导致检索速度较慢相比, 本发明实施例提供的方案可以提供并 行存储以及并行数据查询, 可以提高存储和检索数据的速度。 附图说明  The embodiment of the present invention improves a method and a device for data storage and data query, and acquires data to be saved by using a cloud storage device; the cloud storage device uniformly distributes the data to be saved to each cloud storage data node, and in parallel The data to be saved is stored in a distributed database of cloud storage. And the cloud storage device obtains an index field input by the user, and generates a query instruction according to the index field; the cloud storage device sends the query instruction to each cloud storage data node, and queries the cloud storage distributed database in parallel. Data; the cloud storage device sends the set of query results of the cloud storage nodes to the user. When data storage and data query are performed in the prior art, when a relational database is used to access data, access is performed in units of blocks, resulting in slower storage and query speed; when using a file system to access data When the operation is a pure file operation, the query cannot be performed according to the specified condition, which is difficult to manage, and the entire file needs to be read and decompressed, which results in a slower retrieval speed. The solution provided by the embodiment of the present invention can provide Parallel storage and parallel data queries can increase the speed at which data is stored and retrieved. DRAWINGS
为了更清楚地说明本发明实施例或现有技术中的技术方案, 下面将对实 施例或现有技术描述中所需要使用的附图作简单地介绍, 显而易见地, 下面 描述中的附图仅仅是本发明的一些实施例, 对于本领域普通技术人员来讲, 在不付出创造性劳动性的前提下, 还可以根据这些附图获得其他的附图。  In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below. Obviously, the drawings in the following description are only It is a certain embodiment of the present invention, and other drawings can be obtained from those skilled in the art without any inventive labor.
图 1为本发明实施例 1提供的一种数据存储的方法的流程图; 图 2为本发明实施例 1提供的一种数据查询的方法的流程图; 图 3为本发明实施例 1提供的一种数据存储的装置的框图; 1 is a flowchart of a method for data storage according to Embodiment 1 of the present invention; 2 is a flowchart of a method for data query according to Embodiment 1 of the present invention; FIG. 3 is a block diagram of a device for data storage according to Embodiment 1 of the present invention;
图 4为本发明实施例 1提供的一种数据查询的装置的框图;  4 is a block diagram of an apparatus for data query according to Embodiment 1 of the present invention;
图 5A为本发明实施例 2提供的一种数据存储、数据查询的方法的流程图; 图 5B为本发明实施例 2提供的一种数据存储、数据查询的方法的示意图; 图 6为本发明实施例 2提供的一种数据存储的装置的框图;  5A is a flowchart of a data storage and data query method according to Embodiment 2 of the present invention; FIG. 5B is a schematic diagram of a data storage and data query method according to Embodiment 2 of the present invention; A block diagram of an apparatus for data storage provided in Embodiment 2;
图 Ί为本发明实施例 2提供的一种数据查询的装置的框图;  Figure Ί is a block diagram of an apparatus for data query provided by Embodiment 2 of the present invention;
图 8为本发明实施例 2提供的一种数据存储的系统的示意图。 具体实施方式  FIG. 8 is a schematic diagram of a system for data storage according to Embodiment 2 of the present invention. detailed description
下面将结合本发明实施例中的附图, 对本发明实施例中的技术方案进行 清楚、 完整地描述, 显然, 所描述的实施例仅仅是本发明一部分实施例, 而 不是全部的实施例。 基于本发明中的实施例, 本领域普通技术人员在没有作 出创造性劳动前提下所获得的所有其他实施例 , 都属于本发明保护的范围。  The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. It is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.
实施例 1  Example 1
本发明实施例提供一种数据存储的方法, 如图 1所示, 该方法包括: 步骤 101 , 云存储设备获取待保存数据;  An embodiment of the present invention provides a data storage method. As shown in FIG. 1 , the method includes: Step 101: A cloud storage device acquires data to be saved.
在本步骤之前还包括: 对云存储规则进行初始配置, 包括定义目录与子 业务类型的规则, 定义子业务类型对应文件的导入规则, 定义数据的生命周 期, 所述生命周期指按照时间定义每类数据的存储策略; 对数据抽取规则进 行初始配置, 包括抽取数据的数据源, 抽取进程的数量, 每个抽取进程对应 的数据范围; 对数据上传规则进行初始配置, 包括上传进程的数量, 每个上 传进程对应的数据范围。  Before the step, the method further includes: initial configuration of the cloud storage rule, including defining a rule of the directory and the sub-service type, defining an import rule of the file corresponding to the sub-service type, and defining a life cycle of the data, where the life cycle refers to each time defined by time. Storage strategy of class data; initial configuration of data extraction rules, including data source for extracting data, number of extraction processes, data range corresponding to each extraction process; initial configuration of data uploading rules, including number of uploading processes, each The data range corresponding to the upload process.
可选的, 根据配置的所述数据抽取规则对外部数据源中的数据进行抽取 获得第一数据, 或者将所述外部数据源中的数据进行格式转换后获取第二数 据;  Optionally, extracting data in the external data source according to the configured data extraction rule to obtain the first data, or performing format conversion on the data in the external data source to obtain the second data;
根据云存储的管理节点获取云存储的中转区路径;  Obtaining a transit path of the cloud storage according to the management node of the cloud storage;
根据配置的所述数据上传规则, 以及所述中转区路径将所述第一数据或 者第二数据保存到云存储的临时文件中转区的相应目录中, 其中所述临时文 步骤 102 ,所述云存储设备将所述待保存数据均勾分布到各云存储数据节 点上, 并行地将所述待保存数据存储到云存储的分布式数据库中。 According to the configured data uploading rule, and the transit zone path, the first data or The second data is saved in the corresponding directory of the temporary file transfer area of the cloud storage, wherein the temporary storage step 102, the cloud storage device distributes the data to be saved to each cloud storage data node, in parallel The data to be saved is stored in a distributed database of cloud storage.
可选的, 所述云存储设备将所述待保存数据根据哈希算法均匀分布到各 云存储数据节点上;  Optionally, the cloud storage device uniformly distributes the to-be-saved data to each cloud storage data node according to a hash algorithm;
所述云存储设备将所述各云存储数据节点上的不同的所述待保存数据同 时存储到云存储的分布式数据库中, 或者, 将所述各云存储数据节点上的同 一个所述待保存数据进行拆分后的片段同时存储到云存储的分布式数据库 中。  The cloud storage device stores the different data to be saved on the cloud storage data nodes in a distributed database of the cloud storage, or the same one on the cloud storage data nodes The fragmented data is saved and distributed to a distributed database of cloud storage.
可选的, 当所述云存储设备将所述待保存数据均匀分布到各云存储数据 节点上, 并行地将所述待保存数据存储到云存储的分布式数据库中时, 随着 所述云存储数据节点的增加, 自动增加并行存储的并行度。  Optionally, when the cloud storage device uniformly distributes the to-be-saved data to each cloud storage data node, and stores the to-be-saved data in a distributed database of the cloud storage in parallel, along with the cloud The increase in storage data nodes automatically increases the parallelism of parallel storage.
可选的, 在所述云存储设备将所述待保存数据均匀分布到各云存储数据 节点上, 并行地将所述待保存数据存储到云存储的分布式数据库中之前, 还 包括:  Optionally, before the cloud storage device uniformly distributes the data to be saved to each cloud storage data node, and before storing the data to be saved in the distributed database of the cloud storage in parallel, the method further includes:
根据所述临时文件中转区的相应目录, 以及配置的所述云存储规则中目 录与子业务类型的规则, 确定所述待保存数据的子业务类型;  Determining a sub-service type of the to-be-saved data according to a corresponding directory of the temporary file transfer area, and a rule of the directory and the sub-service type in the configured cloud storage rule;
根据配置的所述云存储规则中子业务类型对应文件的导入规则, 将所述 待保存数据均匀分布到各云存储节点上, 并行地将所述第一文件中的所述待 保存数据存储到云存储的数据库中。  And the data to be saved is evenly distributed to each cloud storage node according to the importing rule of the corresponding file of the sub-service type in the cloud storage rule, and the data to be saved in the first file is stored in parallel to Cloud storage in the database.
进一步的, 根据配置的所述云存储规则中数据生命周期的规则, 对所述 云存储的分布式数据库中的所述待保存数据的不同时期进行不同的处理。  Further, different times of the data to be saved in the distributed database of the cloud storage are processed differently according to the configured rules of the data life cycle in the cloud storage rule.
另外, 根据配置的所述云存储规则对保存在所述分布式数据库中的不同 用途的数据保存为不同的份数, 其中, 所述不同用途的数据包括生产数据和 备份数据, 所述生产数据供查询时使用。  In addition, data for different uses stored in the distributed database is saved as different number of copies according to the configured cloud storage rule, wherein the data of different uses includes production data and backup data, and the production data Used for querying.
本发明实施例提供一种数据存储的方法, 通过将待保存数据均勾分布到 各云存储数据节点上, 并行地将所述待保存数据存储到云存储的分布式数据 库中, 数据库中数据记录的分布式存储, 使得可以快速保存数据。 The embodiment of the invention provides a data storage method, by distributing the data to be saved to On each cloud storage data node, the data to be saved is stored in parallel to the distributed database of the cloud storage, and the distributed storage of the data records in the database enables the data to be saved quickly.
本发明实施例提供一种数据查询的方法, 如图 2所示, 该方法包括: 步骤 201 , 云存储设备获取用户输入的索引字段,根据所述索引字段生成 查询指令;  An embodiment of the present invention provides a data query method. As shown in FIG. 2, the method includes: Step 201: A cloud storage device acquires an index field input by a user, and generates a query instruction according to the index field.
例如, 用户可以输入手机号码以及待查询详单的月份, 可以根据手机号 码以及月份生成查询指令, 进行后续查询操作。  For example, the user can input the mobile phone number and the month of the detailed list to be queried, and can generate a query command according to the mobile phone number and the month to perform subsequent query operations.
进一步的, 通过查询接口, 接收用户输入的索引字段。  Further, the index field of the user input is received through the query interface.
步骤 202 , 所述云存储设备将所述查询指令发送到各云存储的数据节点, 并行地在云存储的分布式数据库中查询数据;  Step 202: The cloud storage device sends the query instruction to each cloud storage data node, and queries the data in a distributed database of the cloud storage in parallel;
保存的数据可以分为生产数据以及备份数据, 查询时仅对生产数据进行 查询, 当生产数据破坏时, 可以采用备份数据对生产数据进行恢复。 查询数 据时并不关心数据的存储位置以及是否压缩。  The saved data can be divided into production data and backup data. When querying, only the production data is queried. When the production data is destroyed, the backup data can be used to recover the production data. When querying data, you don't care where the data is stored and whether it is compressed.
所述云存储设备将所述查询指令同时发送到各云存储的数据节点上; 所 述云存储设备同时在所述各云存储的数据节点上承载的分布式数据库中查询 符合所述查询指令的数据。  The cloud storage device sends the query instruction to the data node of each cloud storage at the same time; the cloud storage device simultaneously queries the distributed database carried by the data nodes of the cloud storage to meet the query instruction. data.
步骤 203 ,所述云存储设备将所述各云存储节点的查询结果的集合发送给 所述用户。  Step 203: The cloud storage device sends the set of query results of the cloud storage nodes to the user.
可选的, 所述云存储设备将所述各云存储节点的查询结果按照用户自定 义规则进行排序, 并将排序后的查询结果集合发送给所述用户; 或者,  Optionally, the cloud storage device sorts the query results of the cloud storage nodes according to a user-defined rule, and sends the sorted query result set to the user; or
所述云存储设备将所述各云存储节点的查询结果按照节点顺序进行排 序, 并将排序后的查询结果集合发送给所述用户; 或者,  The cloud storage device sorts the query results of the cloud storage nodes in the order of the nodes, and sends the sorted query result sets to the user; or
所述云存储设备将所述各云存储节点的查询结果按照所述查询结果中的 关键字进行顺序, 并将排序后的查询结果集合发送给所述用户。  The cloud storage device sequentially performs the query result of each cloud storage node according to keywords in the query result, and sends the sorted query result set to the user.
本发明实施例提供一种数据查询的方法, 通过根据查询指令, 各云存储 的数据节点并行地在云存储的分布式数据库中查询数据, 使得可以极大地提 升查询性能。 本发明实施例提供一种数据存储的装置, 该装置可以为云存储设备, 如 图 3所示, 该装置包括: 获取模块 301 , 存储模块 302; The embodiment of the invention provides a data query method. According to the query instruction, each cloud storage data node queries data in a distributed database of the cloud storage in parallel, so that the query performance can be greatly improved. The embodiment of the present invention provides a data storage device, which may be a cloud storage device, as shown in FIG. 3, the device includes: an obtaining module 301, a storage module 302;
获取模块 301 , 用于获取待保存数据;  The obtaining module 301 is configured to obtain data to be saved;
进一步的, 所述获取模块 301 中的数据获取单元, 用于根据配置的所述 数据抽取规则对外部数据源中的数据进行抽取获得第一数据, 或者将所述外 部数据源中的数据进行格式转换后获取第二数据;  Further, the data obtaining unit in the obtaining module 301 is configured to extract data in an external data source according to the configured data extraction rule to obtain first data, or format data in the external data source. Obtaining second data after conversion;
所述获取模块 301 中的数据上传单元, 用于根据云存储的管理节点获取 云存储的中转区路径; 以及根据配置的所述数据上传规则, 以及所述中转区 路径将所述第一数据或者第二数据保存到云存储的临时文件中转区的相应目 保存。  a data uploading unit in the acquiring module 301, configured to acquire a transit area path of the cloud storage according to the management node of the cloud storage; and the first data according to the configured data uploading rule, and the transit area path The second data is saved to the corresponding destination of the temporary file transfer area of the cloud storage.
进一步的, 所述装置还包括初始配置模块, 用于对云存储规则进行初始 配置, 包括定义目录与子业务类型的规则, 定义子业务类型对应文件的导入 规则, 定义数据的生命周期, 所述生命周期指按照时间定义每类数据的存储 策略; 以及对数据抽取规则进行初始配置, 包括抽取数据的数据源, 抽取进 程的数量, 每个抽取进程对应的数据范围; 以及对数据上传规则进行初始配 置, 包括上传进程的数量, 每个上传进程对应的数据范围。  Further, the device further includes an initial configuration module, configured to initially configure a cloud storage rule, including a rule for defining a directory and a sub-service type, an import rule for defining a file corresponding to the sub-service type, and defining a life cycle of the data, where The life cycle refers to the storage strategy for defining each type of data according to time; and the initial configuration of the data extraction rules, including the data source for extracting data, the number of extraction processes, the data range corresponding to each extraction process; and the initial uploading of data upload rules Configuration, including the number of upload processes, the data range corresponding to each upload process.
存储模块 302 , 用于将所述待保存数据均匀分布到各云存储数据节点上, 并行地将所述待保存数据存储到云存储的分布式数据库中。  The storage module 302 is configured to evenly distribute the data to be saved to each cloud storage data node, and store the data to be saved in a distributed database of the cloud storage in parallel.
其中, 所述存储模块 302 中的分布单元, 用于将所述待保存数据根据哈 希算法均匀分布到各云存储数据节点上;  The distribution unit in the storage module 302 is configured to uniformly distribute the to-be-saved data to each cloud storage data node according to a hash algorithm;
所述存储模块 302 中的存储单元, 用于将所述各云存储数据节点上的不 同的所述待保存数据同时存储到云存储的分布式数据库中, 或者, 将所述各 云存储数据节点上的同一个所述待保存数据进行拆分后的片段同时存储到云 存储的分布式数据库中。  a storage unit in the storage module 302, configured to simultaneously store different data to be saved on the cloud storage data nodes into a distributed database of the cloud storage, or store the data nodes in the cloud storage The split segments on the same data to be saved are simultaneously stored in a distributed database of cloud storage.
当将所述待保存数据均勾分布到各云存储数据节点上, 并行地将所述待 保存数据存储到云存储的分布式数据库中时, 随着所述云存储数据节点的增 加, 自动增加并行存储的并行度。 When the data to be saved is hooked to each cloud storage data node, and the data to be saved is stored in a distributed database of the cloud storage in parallel, the cloud storage data node increases. Plus, automatically increases the parallelism of parallel storage.
进一步的, 所述装置还包括: 确定模块, 用于根据所述临时文件中转区 的相应目录, 以及配置的所述云存储规则中目录与子业务类型的规则, 确定 所述待保存数据的子业务类型;  Further, the device further includes: a determining module, configured to determine, according to a corresponding directory of the temporary file transfer area, and a configured rule of the directory and the sub-service type in the cloud storage rule, the child of the to-be-saved data business type;
所述存储模块 302 ,用于根据配置的所述云存储规则中子业务类型对应文 件的导入规则, 将所述待保存数据均勾分布到各云存储的数据节点上, 并行 地将所述待保存数据存储到云存储的分布式数据库中。  The storage module 302 is configured to distribute the data to be saved to the data nodes of each cloud storage according to the imported import rule of the sub-service type corresponding file in the cloud storage rule, and the Save the data to a distributed database of cloud storage.
进一步的, 所述装置还包括: 管理模块, 用于根据配置的所述云存储规 则中数据生命周期的规则, 对所述云存储的分布式数据库中的所述待保存数 据的不同时期进行不同的处理。  Further, the device further includes: a management module, configured to perform, according to the configured rules of the data life cycle in the cloud storage rule, different periods of the data to be saved in the distributed database of the cloud storage Processing.
所述存储模块还用于:  The storage module is further configured to:
根据配置的所述云存储规则对保存在所述分布式数据库中的不同用途的 数据保存为不同的份数, 其中, 所述不同用途的数据包括生产数据和备份数 据, 所述生产数据供查询时使用。  The data of different uses stored in the distributed database is saved as different number of copies according to the configured cloud storage rule, wherein the data of the different uses includes production data and backup data, and the production data is used for querying. When used.
本发明实施例提供一种数据存储的装置, 通过获取模块获取待保存数据, 存储模块将待保存数据均匀分布到各云存储的数据节点上, 并行地将待保存 数据存储到云存储的分布式数据库中, 数据库中数据记录的分布式存储, 使 得可以快速保存数据。  An embodiment of the present invention provides a device for storing data, where the storage module obtains data to be saved, and the storage module uniformly distributes the data to be saved to the data nodes of each cloud storage, and stores the data to be saved in parallel to the cloud storage in parallel. In the database, the distributed storage of data records in the database makes it possible to save data quickly.
本发明实施例提供一种数据查询的装置, 该装置可以为云存储设备, 如 图 4所示, 该装置包括: 获取模块 401 , 处理模块 402;  The embodiment of the present invention provides a device for querying data, and the device may be a cloud storage device, as shown in FIG. 4, the device includes: an obtaining module 401, a processing module 402;
获取模块 401 , 用于获取用户输入的索引字段,根据所述索引字段生成查 询指令;  The obtaining module 401 is configured to obtain an index field input by the user, and generate a query instruction according to the index field;
处理模块 402 , 用于将所述查询指令发送到各云存储的数据节点, 并行地 在在云存储的分布式数据库中查询数据; 以及将所述各云存储节点的查询结 果的集合发送给所述用户。  The processing module 402 is configured to send the query instruction to each cloud storage data node, query data in a distributed database of the cloud storage in parallel, and send the set of the query results of the cloud storage nodes to the User.
其中, 所述处理模块 402 中的发送单元, 用于将所述查询指令同时发送 到各云存储的数据节点上; 所述处理模块 402 中的处理单元, 用于同时在所述各云存储的数据节点 上承载的分布式数据库中查询符合所述查询指令的数据。 The sending unit in the processing module 402 is configured to send the query instruction to the data nodes of each cloud storage at the same time; The processing unit in the processing module 402 is configured to simultaneously query data that meets the query instruction in a distributed database carried on the data nodes of each cloud storage.
可选的, 所述处理模块 402 , 用于:  Optionally, the processing module 402 is configured to:
将所述各云存储节点的查询结果按照用户自定义规则进行排序, 并将排 序后的查询结果集合发送给所述用户; 或者,  Querying the query results of the cloud storage nodes according to a user-defined rule, and sending the sorted query result set to the user; or
将所述各云存储节点的查询结果按照节点顺序进行排序, 并将排序后的 查询结果集合发送给所述用户; 或者,  Querying the query results of the cloud storage nodes in a node order, and sending the sorted query result sets to the user; or
将所述各云存储节点的查询结果按照所述查询结果中的关键字进行顺 序, 并将排序后的查询结果集合发送给所述用户。  The query results of the cloud storage nodes are sequenced according to keywords in the query result, and the sorted query result sets are sent to the user.
本发明实施例提供一种数据查询的装置, 通过获取模块生成的查询指令, 处理模块同时并行地在数据库中查询数据, 使得可以极大地提升查询性能。  The embodiment of the invention provides a device for querying data. By acquiring the query instruction generated by the module, the processing module simultaneously queries the data in the database in parallel, so that the query performance can be greatly improved.
实施例 2  Example 2
本发明实施例提供一种数据存储、 数据查询的方法, 如图 5 所示, 该方 法包括:  The embodiment of the invention provides a data storage and data query method. As shown in FIG. 5, the method includes:
步骤 501 , 云存储设备对云存储规则进行初始配置;  Step 501: The cloud storage device performs initial configuration on the cloud storage rule.
可选的, 云存储设备接收管理员对云存储规则进行初始配置, 其中包括 定义存储的基本规则, 以及定义数据的生命周期。  Optionally, the cloud storage device receives an initial configuration of the cloud storage rule by the administrator, including defining a basic rule of the storage, and defining a data life cycle.
进一步的, 1 )定义使用云存储的业务类型;  Further, 1) defining a service type using cloud storage;
具体的, ①定义业务名称; 例如详单业务、 账单业务、 电子文档业务等 等。  Specifically, 1 defines a business name; for example, a detailed business, a billing service, an electronic document service, and the like.
②定义云存储业务的保存数量; 例如, 设置至少保存 2份数据, 例如, 可以设置详单业务保存 3份。  2 Define the number of saved cloud storage services; for example, set to save at least 2 copies of data, for example, you can set up a detailed service to save 3 copies.
③设置每份数据的用途; 例如数据可以分为生产数据或者备份数据, 其 中, 生产数据用于对数据的查询, 备份数据用于对生产数据的恢复, 不用于 数据查询。 例如, 保存的 3份详单业务, 设置第 1份和第 2份为生产数据, 第 3份为备份数据, 则正常情况下只提供对第 1份和第 2份数据的访问, 当 第 1份数据或者第 2份数据损坏了, 则可以选择从第 3份数据进行恢复。 另 外, 可以设置 3份数据都用于生产, 此时云存储的调度程序则在 3份数据间 均匀的分配请求。 3 Set the purpose of each data; for example, the data can be divided into production data or backup data, wherein the production data is used for querying data, the backup data is used for recovery of production data, and is not used for data query. For example, if you save 3 copies of the detailed business, set the first and second copies as production data, and the third copy as backup data, then only the first and second data accesses are normally provided. If the data or the second data is corrupted, you can choose to recover from the third data. Another In addition, three copies of data can be set for production, and the cloud storage scheduler evenly distributes requests among three pieces of data.
④定义业务类型的数据生命周期; 其中, 生命周期是指按照时间定义每 类数据的存储策略。  4 Define the data life cycle of the business type; where, the life cycle refers to the storage strategy for defining each type of data according to time.
存储策略可以包括: 不压缩存储、 压缩存储以及删除。 其中压缩存储可 以定义不同的压缩算法, 例如低密压缩, 即对查询效率较高的数据采用压缩 比在 2: 1左右的压缩率; 中度压缩, 即兼顾查询和存储空间, 对数据采用压 缩比在 5: 1 左右的压缩率; 高密度压缩, 即对查询效率较^^的数据采用压缩 比高于 8: 1的压缩率。  Storage policies can include: no compression storage, compressed storage, and deletion. The compressed storage can define different compression algorithms, such as low-density compression, that is, the compression ratio of the data with high query efficiency is about 2:1; the moderate compression, that is, the query and storage space, and the compression of the data. The compression ratio is about 5:1; high-density compression, that is, the compression ratio is higher than the compression ratio of 8:1 for the data with better query efficiency.
另外, 不同的存储时间范围可以采用不同的存储策略, 例如可以设置在 数据存入数据库时、 数据存储第 X天、 数据存储第 Y月分别采用不同的存储 策略。  In addition, different storage time ranges can adopt different storage policies. For example, different storage policies can be set when the data is stored in the database, the data storage is on the Xth day, and the data storage is in the Yth month.
采用业务类型的数据生命周期规则对数据库的管理, 可以自动进行数据 的压缩与清除, 降低管理难度, 提高数据库的使用率。  The management of the database using the data lifecycle rules of the business type can automatically compress and clear the data, reduce the management difficulty, and improve the database usage rate.
另外, 生产数据和备用数据可以有不同的数据生命周期, 例如生产数据 可以采用存入数据库时不压缩、 30天后低密压缩、 90天后删除; 备用数据可 以采用存入数据库时采用中密度压缩, 90天后采用高密度压缩, 永不删除。  In addition, production data and backup data can have different data life cycles. For example, production data can be stored without being compressed in the database, low-density compression after 30 days, and deleted after 90 days. The backup data can be compressed by medium density when stored in the database. High-density compression after 90 days, never deleted.
2 ) 定义子业务类型; 例如详单业务, 可以分为 GSM ( G loba l Sys tem of Mobi le communica t ion, 全球移动通讯系统)伴单、 GSM语音伴单、 短信伴单 等, 这些子业务类型相当于云存储的表。  2) Define the sub-service type; for example, the detailed service, which can be divided into GSM (Globa l Stem tem of Mobi le communication), GSM voice companion, SMS companion, etc. The type is equivalent to a table of cloud storage.
默认情况下, 子业务类型的保存份数等信息继承业务类型的设置。 另外, 子业务类型也可以单独设置自己的保存份数, 及每份的数据生命周期等信息。  By default, information such as the number of copies of a sub-business type inherits the settings of the business type. In addition, the sub-service type can also set its own number of saved copies, and the data life cycle of each copy.
3 )设置云存储的目录与子业务类型的关系, 此关系按最长路径优先的原 则进行设置。 例如, "/CDR/ " 对应默认业务, 则目录 "/CDR/gsm_cdr/" 下 所有文件(含子目录下的文件)都属于子业务 "GSM语音" ,导入到 GSM语音 话单表中, 目录 "/CDR/gprs/"下所有文件(含子目录下的文件)导入到 GPRS 话单表中, 目录 " /CDR/ "下的其他文件都导入 "默认业务" 表。 也就是说, 当根据目录查找子业务类型时, 一个目录如果有五级, 则从第五级目录开始 查找,如果在第五级中没有目录存在, 则从第四级目录开始查找,, 以此类推。 3) Set the relationship between the cloud storage directory and the sub-service type. This relationship is set according to the principle of the longest path first. For example, if "/CDR/" corresponds to the default service, all files under the directory "/CDR/gsm_cdr/" (including files in subdirectories) belong to the sub-service "GSM voice" and are imported into the GSM voice bill list. All files under "/CDR/gprs/" (including files in subdirectories) are imported into the GPRS CDR list, and other files under the directory " /CDR/ " are imported into the "Default Business" table. That is, When a sub-service type is searched according to the directory, if there is five levels of a directory, the search starts from the fifth-level directory, if no directory exists in the fifth level, the search starts from the fourth-level directory, and so on.
4 )定义这些子业务类型对应文件的导入规则,其中主要设置的内容包括: 信息的名称; 信息的解析位置; 信息的类型, 例如整型、 小数型、 字符串、 大文本(用于存储图像、 文件) 等; 信息是否为数据分布型, 例如可以根据 信息, 对整个数据进行均勾分布; 该信息是否为时间型数据, 可以根据该字 段定义数据的生命周期; 当信息为时间型数据时, 设置时间格式, 例如采用 YYYY-MM-DD HH24: MI : SS格式等。  4) Define import rules for files corresponding to these sub-service types, where the main settings include: Name of the information; Location of the information; Type of information, such as integer, decimal, string, large text (for storing images) , file), etc.; whether the information is data-distributed, for example, the entire data can be uniformly distributed according to the information; whether the information is time-type data, the life cycle of the data can be defined according to the field; when the information is time-type data , Set the time format, for example, using YYYY-MM-DD HH24: MI: SS format, etc.
例如, 对于 GSM语音话单可以设置如下值:  For example, for GSM voice tickets you can set the following values:
信息名称为: 手机号码(如 13606401754 ); 信息解析位置为 1 ; 信息的 类型为字符串 STRING, 即按照字符串处理; 信息是否为数据分布型: 是, 例 如 GSM详单按照手机号码均匀分布; 信息是否为时间型数据: 不是时间型; 不设置时间格式。  The information name is: mobile phone number (such as 13606401754); information resolution position is 1; information type is string STRING, that is, according to string processing; information is data distribution type: Yes, for example, GSM detailed list is evenly distributed according to mobile phone number; Whether the information is time type data: not time type; no time format is set.
GSM语音话单的格式为:  The format of the GSM voice bill is:
13606401754 1 01 1 053188163000 | 2011-12-31 09: 30: 00 | 51 1 0. 20 1 ···. . 手 机号码为第一个字段, 则云存储进行信息解析的位置为 1。  13606401754 1 01 1 053188163000 | 2011-12-31 09: 30: 00 | 51 1 0. 20 1 ···. . The phone number is the first field, and the location where the cloud storage parses the information is 1.
再例如: 信息名称为: 通话开始时间; 信息解析位置为 4 ; 信息的类型为 字符串 STRING, 即按照字符串处理; 信息是否为数据分布型: 是, 例如 GSM 详单按照手机号码均匀分布; 信息是否为时间型数据: 是时间型; 时间格式 为: YYYY-MM-DD HH24: MI : SS , 如 2011-12-31 09: 30: 00。  For another example: the information name is: call start time; information resolution position is 4; information type is string STRING, that is, according to string processing; information is data distribution type: Yes, for example, GSM detailed list is evenly distributed according to mobile phone number; Whether the information is time type data: It is time type; the time format is: YYYY-MM-DD HH24: MI : SS , such as 2011-12-31 09: 30: 00.
信息名称为:话单类型;信息解析位置为 2;信息的类型为字符串 STRING, 即按照字符串处理; 信息是否为数据分布型: 否; 信息是否为时间型数据: 不是时间型; 不设置时间格式。  The information name is: CDR type; the information resolution position is 2; the information type is string STRING, that is, according to the string processing; whether the information is data distribution type: No; whether the information is time type data: not time type; Time format.
信息名称为: 对方号码(如: 053188163000 ); 信息解析位置为 3; 信息 的类型为字符串 STRING , 即按照字符串处理; 信息是否为数据分布型: 否; 信息是否为时间型数据: 不是时间型; 不设置时间格式。  The information name is: the other party number (such as: 053188163000); the information resolution position is 3; the information type is string STRING, that is, according to the string processing; whether the information is data distribution type: No; whether the information is time type data: not time Type; no time format is set.
信息名称为: 通话时长(如 51 秒); 信息解析位置为 5; 信息的类型为 字符串 STRING, 即按照字符串处理; 信息是否为数据分布型: 否; 信息是否 为时间型数据: 不是时间型; 不设置时间格式。 The information name is: the duration of the call (such as 51 seconds); the information resolution position is 5; the type of information is The string STRING, that is, processed according to the string; whether the information is data distribution type: No; whether the information is time type data: not time type; no time format is set.
信息名称为: 通话费用 (如 0. 20元); 信息解析位置为 6; 信息的类型为 小数型; 信息是否为数据分布型: 否; 信息是否为时间型数据: 不是时间型; 不设置时间格式。  The information name is: call cost (such as 0. 20 yuan); information resolution position is 6; information type is decimal type; information is data distribution type: no; information is time type data: not time type; no time is set format.
步骤 502 , 所述云存储设备对数据抽取规则进行初始配置; 以及对数据上 传规则进行初始配置;  Step 502: The cloud storage device performs initial configuration on a data extraction rule, and initially configures a data upload rule.
所述云存储设备接收管理员对数据抽取以及数据上传规则的初始配置, 具体的, 对数据抽取规则包括以下内容: ①抽取数据采用的外部数据源, 以 及与数据源的连接方式; ②数据抽取进程的数量, 以及每个数据抽取进程对 应的数据范围, 例如, 按照地区、 手机号段、 号码尾数等进行数据抽取; ③ 抽取的文件的大小, 例如第一文件为 10M,以及抽取号码数阈值, 例如最多抽 取 1 00个电话号码; ④数据抽取后的文件存放路径。 对数据上传规则包括以 下内容: 数据上传进程的数量, 以及每个数据上传进程对应的数据范围。  The cloud storage device receives an initial configuration of the data extraction and data uploading rules by the administrator. Specifically, the data extraction rule includes the following contents: 1 an external data source used for extracting data, and a connection mode with the data source; 2 data extraction The number of processes, and the data range corresponding to each data extraction process, for example, data extraction according to region, mobile number segment, number mantissa, etc.; 3 size of extracted files, for example, the first file is 10M, and the number of extracted numbers is threshold For example, extract up to 100 phone numbers; 4 file storage path after data extraction. The data uploading rules include the following: The number of data uploading processes, and the data range corresponding to each data uploading process.
需要说明的是, 步骤 501与步骤 502为执行本发明实施例所做的准备工 作,步骤 501与步骤 502的执行顺序并不是严格固定的,可以先执行步骤 501 , 也可以先执行步骤 502。  It should be noted that the steps 501 and 502 are performed for the implementation of the embodiment of the present invention. The execution sequence of the steps 501 and 502 is not strictly fixed. Step 501 may be performed first, or step 502 may be performed first.
步骤 503 ,所述云存储设备根据配置的所述数据抽取规则对外部数据源中 的数据进行抽取, 获得第一数据, 或者将所述外部数据源中的数据进行格式 转换后获取第二数据;  Step 503: The cloud storage device extracts data in the external data source according to the configured data extraction rule, obtains the first data, or performs format conversion on the data in the external data source to obtain the second data.
需要说明的是, 外部数据源可以为终端中保存的数据源。  It should be noted that the external data source can be a data source saved in the terminal.
其中, 可以直接接收外部数据源通过将话单格式转换为云存储能够识别 的格式的数据, 此时不需要对外部数据源进行数据抽取, 获取格式转换后的 第二数据, 然后对接收到的数据进行上传以及导入云存储的分布式数据库。  The external data source can directly receive the data in a format that can be recognized by the cloud storage by converting the CDR format into a format that can be recognized by the cloud storage. In this case, the data is not extracted from the external data source, the second data after the format conversion is obtained, and then the received data is received. Data is uploaded and imported into a distributed database of cloud storage.
每个数据抽取进程会按照配置的数据抽取规则进行数据抽取, 抽取的数 据的格式为文本文件的格式, 例如: 话单文件, 其格式为:  Each data extraction process performs data extraction according to the configured data extraction rules. The format of the extracted data is in the format of a text file, for example: a bill file, in the format:
13606400001 1 01 1 053188163000 1 2011-12-31 09: 30: 00 | 51 | 0. 20 1 ···. · 13606400001101113906400128 I 2011-12-31 09: 35: 10165 I 0.401 ···.. 13606400001 1 01 1 053188163000 1 2011-12-31 09: 30: 00 | 51 | 0. 20 1 ···. 13606400001101113906400128 I 2011-12-31 09: 35: 10165 I 0.401 ···..
13606401754101105318816300012011-12-31 09: 30: 0015110.201···.. 13606401754101113906400128 I 2011-12-31 09: 35: 10 I 65 I 0.401···.. 其中, 文件每行代表一条通话详单, 以竖线分割, 各字段的定义为:13606401754101105318816300012011-12-31 09: 30: 0015110.201···.. 13606401754101113906400128 I 2011-12-31 09: 35: 10 I 65 I 0.401···.. where each line of the file represents a detailed call list, with vertical lines Split, each field is defined as:
1.手机号码, 例如, 13606400001; 1. Mobile number, for example, 13606400001;
2.通话类型, 其中 01代表主叫, 02代表被叫;  2. Call type, where 01 stands for the caller and 02 stands for the caller;
3.对方号码, 例如, 053188163000;  3. The other party number, for example, 053188163000;
4.通话时间, 例如, 2011-12-31 09: 30: 00;  4. Call time, for example, 2011-12-31 09: 30: 00;
5. 通话时长(秒), 例如, 51秒;  5. The length of the call (in seconds), for example, 51 seconds;
6.通话费 (元), 例如, 0.2。  6. Call charges (yuan), for example, 0.2.
例如第一条通话详单的含义为 13606400001 手机机主在 2011-12-31 09: 30: 00拨打了号码 053188163000 通话时长 51秒, 通话费为 0.2元。  For example, the meaning of the first call detail list is 13606400001. The mobile phone owner dialed the number 053188163000 at 2011-12-31 09: 30: 00. The call duration is 51 seconds, and the call charge is 0.2 yuan.
步骤 504,所述云存储设备根据云存储的管理节点获取云存储的中转区路 径;  Step 504: The cloud storage device acquires a transit path of the cloud storage according to the management node of the cloud storage.
例如, 数据上传进程连接到云存储的管理节点, 调用云存储提供的获取 上传目录服务, 其中数据上传目录服务包括的参数为: 业务类型、 子业务类 型、 数据特征(例如区号为 531的济南地区, 20111201账期。 云存储的管理 节点根据业务设置以及各云存储节点的忙闲程度确定该数据上传进程可以使 用的文件目录, 并将该文件目录组织成 URL ( Uniform / Universal Resource Locator, 统一资源定位符)格式返回给数据上传进程, 其中 URL格式的目录 即 为 需 要 获 取 的 中 转 区 路 径 , 例 如 可 以 为 ftp:〃192.168.1. l/CDR/gsm_cdr/531/20111201/o For example, the data uploading process is connected to the management node of the cloud storage, and the uploading directory service provided by the cloud storage is invoked, wherein the data uploading directory service includes parameters: a service type, a sub-service type, and a data feature (for example, the Jinan area with the area code 531) , 20111201. The cloud storage management node determines the file directory that the data uploading process can use according to the service setting and the busyness of each cloud storage node, and organizes the file directory into a URL (Uniform / Universal Resource Locator, unified resource) The locator format is returned to the data uploading process. The directory in the URL format is the path of the transit zone to be obtained. For example, it can be ftp: 〃192.168.1. l/CDR/gsm_cdr/531/20111201/ o
步骤 505, 所述云存储设备根据配置的所述数据上传规则, 以及所述中转 区路径将所述第一数据或者第二数据保存到云存储的临时文件中转区的相应 目录中, 其中所述临时文件中转区中的各个目录下的各个文件以文本文件格 式保存; 其中, 根据具体数据的类型, 将数据保存到云存储的临时文件中转区的 相应目录中, 例如将 GSM格式数据保存到 "/CDR/gsm_cdi7" 下所有文件中, 将 GPRS格式数据保存到 "/CDR/gpr s/" 下所有文件中。 Step 505: The cloud storage device saves the first data or the second data to a corresponding directory of a temporary file transfer area of the cloud storage according to the configured data uploading rule and the transit area path, where Each file in each directory in the temporary file transfer area is saved in a text file format; According to the type of the specific data, the data is saved to the corresponding directory in the temporary file transfer area of the cloud storage, for example, the GSM format data is saved in all files under "/CDR/gsm_cdi7", and the GPRS format data is saved to "/" CDR/gpr s/" in all files.
需要说明的是, 当要存储到云存储的分布式数据库中的数据量非常大时, 或者网络条件不好时, 在导入过程中可能会发生中断, 因此可以通过本发明 实施例中建立的临时文件中转区, 可以保证数据完整的上传到云存储的第二 目录中后, 再导入到云存储的分布式数据库中, 提高传输的数据的完整性。  It should be noted that when the amount of data to be stored in the distributed database of the cloud storage is very large, or when the network condition is not good, an interruption may occur during the import process, and thus the temporary establishment established in the embodiment of the present invention may be adopted. The file transfer area can ensure that the data is completely uploaded to the second directory of the cloud storage, and then imported into the distributed database of the cloud storage to improve the integrity of the transmitted data.
步骤 506 , 所述云存储设备根据所述临时文件中转区的相应目录, 以及配 置的所述云存储规则中目录与子业务类型的规则, 确定所述待保存数据的子 业务类型;  Step 506: The cloud storage device determines, according to the corresponding directory of the temporary file transfer area, and the configured rules of the directory and the sub-service type in the cloud storage rule, the sub-service type of the data to be saved;
云存储的数据导入进程将中转区临时文件保存到云存储的分布式数据库 中, 首先, 数据导入进程根据待导入数据所在的中转区的相应目录确定子业 务类型。 具体的, 从步骤 501 中初始配置的云存储规则中预定义的 "目录与 子业务类型" 规则中得到每个目录下文件对应的表名, 然后多个数据导入进 程并行扫描多个目录, 从而确定第二目录下的子业务类型。  The data import process of the cloud storage saves the transit zone temporary file to the distributed database of the cloud storage. First, the data import process determines the sub-service type according to the corresponding directory of the transit zone where the data to be imported is located. Specifically, the table name corresponding to the file in each directory is obtained from the predefined "directory and sub-service type" rule in the cloud storage rule initially configured in step 501, and then multiple data import processes scan multiple directories in parallel, thereby Determine the sub-service type under the second directory.
例如, 当处理 "/CDR/gsm_cdi7 " 目录下待导入文件时(这些已经确认传 输完毕, 可以进行数据导入), 会按照预定义的规则导入到 GSM语音话单表。 当才艮据 "/CDR/gsm_cdr/ " 目录确定子业务类型时, 则可以确定其子业务类型 为 GSM语音详单。  For example, when processing a file to be imported in the "/CDR/gsm_cdi7" directory (these have confirmed that the transfer is complete, data can be imported), it will be imported into the GSM voice bill table according to the predefined rules. When the sub-service type is determined according to the "/CDR/gsm_cdr/" directory, it can be determined that its sub-service type is a GSM voice list.
步骤 507 ,所述云存储设备根据配置的所述云存储规则中子业务类型对应 文件的导入规则, 将所述待保存数据均勾分布到各云存储的数据节点上, 并 行地将所述待保存数据存储到云存储的分布式数据库中;  Step 507: The cloud storage device distributes the to-be-saved data to the data nodes of each cloud storage according to the imported import rule of the sub-service type corresponding file in the cloud storage rule, and the Save the data storage to a distributed database of cloud storage;
其中, 所述云存储设备将所述待保存数据根据哈希算法均勾分布到各云 存储数据节点上; 所述云存储设备将所述各云存储数据节点上的不同的所述 待保存数据同时存储到云存储的分布式数据库中, 或者, 将所述各云存储数 据节点上的同一个所述待保存数据进行拆分后的片段同时存储到云存储的分 布式数据库中。 根据步骤 501 中定义的子业务类型对应文件的导入规则, 将临时文件中 转区中的文件导入到云存储的分布式数据库中, 其中, 导入时, 需要根据导 入规则中定义的数据分布规则, 即信息为数据分布型时, 根据哈希 hash算法 自动将临时文件中转区中的待保存数据均勾分布到各云存储节点上。 例如可 以根据手机号码进行均勾分布, 则临时文件中转区中待保存数据中的手机号 码 1所在的记录分布到节点 A, 手机号码 2所在的记录分布到节点 B。 The cloud storage device stores the data to be saved according to a hash algorithm to each cloud storage data node; the cloud storage device stores the different data to be saved on the data node in the cloud storage device. At the same time, it is stored in the distributed database of the cloud storage, or the fragments that are split by the same data to be saved on the cloud storage data nodes are simultaneously stored in the distributed database of the cloud storage. According to the import rule of the file corresponding to the sub-service type defined in step 501, the file in the temporary file transfer area is imported into the distributed database of the cloud storage, wherein, when importing, according to the data distribution rule defined in the import rule, When the information is of a data distribution type, the data to be saved in the temporary file transfer area is automatically distributed to each cloud storage node according to the hash hash algorithm. For example, the mobile phone number can be uniformly distributed, and the record of the mobile phone number 1 in the data to be saved in the temporary file transfer area is distributed to the node A, and the record of the mobile phone number 2 is distributed to the node B.
进一步的, 本步骤为云存储自动进行并行导入的, 并且对于云存储节点 的增加, 自动增加并行度, 例如, 云存储节点为 3个时, 并行度为 3 , 云存储 节点为 4个时, 并行度为 4。 需要说明的是, 传统的导入方法默认不是并行导 入数据库, 虽然可以手动指定进行并行导入数据, 但不会随着硬件能力的增 据导入和读取, 较传统的数据导入方式性能有较大的提升。  Further, in this step, the cloud storage is automatically imported in parallel, and the degree of parallelism is automatically increased for the increase of the cloud storage node. For example, when there are three cloud storage nodes, the degree of parallelism is 3, and when the cloud storage nodes are four, The degree of parallelism is 4. It should be noted that the traditional import method does not import the database in parallel by default. Although it can be manually specified to import data in parallel, it will not be imported and read with the addition of hardware capabilities, which is more performance than the traditional data import method. Upgrade.
另外, 将待保存数据导入云存储的分布式数据库时, 可以根据配置的云 存储规则对数据进行多份保存, 例如针对详单业务可以保存 3份数据, 第 1 份与第 2份为生产数据, 供查询时使用, 并且不进行压缩, 第 3份为备份数 据, 进行中密度压缩。  In addition, when the data to be saved is imported into the distributed database of the cloud storage, the data can be saved in multiple copies according to the configured cloud storage rules. For example, three copies of the data can be saved for the detailed service, and the first and second copies are production data. , used for query, and does not compress, the third is backup data, for medium density compression.
步骤 508 ,所述云存储设备根据配置的所述云存储规则中数据生命周期的 规则, 对所述云存储的分布式数据库中的所述待保存数据的不同时期进行不 同的处理;  Step 508: The cloud storage device performs different processing on different periods of the data to be saved in the distributed database of the cloud storage according to the configured rules of the data life cycle in the cloud storage rule.
例如, 对于生产数据可以在 30天后进行低密度压缩, 90天后进行删除; 对于备份数据在存储入云存储的分布式数据库的同时进行中密度压缩, 在 90 天后进行高密度压缩, 永不删除。  For example, for production data, low-density compression can be performed after 30 days, and deleted after 90 days. For backup data, the medium-density compression is performed while being stored in the distributed database of cloud storage, and high-density compression is performed after 90 days, and will never be deleted.
根据数据生命周期的规则对数据库进行自动压缩与清除, 可以提高数据 库的存储率, 并且可以减轻维护人员的工作量, 降低管理难度。  Automatically compressing and clearing the database according to the rules of the data life cycle can improve the storage rate of the database, and can reduce the workload of maintenance personnel and reduce the management difficulty.
步骤 509 , 所述云存储设备获取用户输入的索引字段,根据所述索引字段 生成查询指令;  Step 509: The cloud storage device acquires an index field input by a user, and generates a query instruction according to the index field.
例如, 索引字段可以为手机号码、 查询月份后。 生成的查询指令则包括 手机号码、 查询月份。 For example, the index field can be the phone number, after the query month. The generated query instructions include Mobile number, query month.
进一步的, 通过查询接口, 接收用户输入的索引字段。  Further, the index field of the user input is received through the query interface.
步骤 510 , 所述云存储设备将所述查询指令发送到各云存储的数据节点, 并行地在云存储的分布式数据库中查询数据;  Step 510: The cloud storage device sends the query instruction to each cloud storage data node, and queries the data in a distributed database of the cloud storage in parallel;
所述云存储设备将所述查询指令同时发送到各云存储的数据节点上; 所述云存储设备同时在所述各云存储的数据节点上承载的分布式数据库 中查询符合所述查询指令的数据。  The cloud storage device sends the query instruction to the data node of each cloud storage at the same time; the cloud storage device simultaneously queries the distributed database carried by the data nodes of the cloud storage to meet the query instruction. data.
需要说明的是, 查询时仅对生产数据进行查询。 查询应用本身不需要关 心数据的存储位置以及是否压缩。 次数据查询中能够并行化执行的步骤会被分解到各个存储节点上并行执行, 从而极大的提升查询性能。  It should be noted that only the production data is queried during the query. The query application itself does not need to care about where the data is stored and whether it is compressed. The steps that can be parallelized in the secondary data query are decomposed into parallel executions on each storage node, which greatly improves query performance.
步骤 511 ,所述云存储设备将所述各云存储节点的查询结果的集合发送给 所述用户。  Step 511: The cloud storage device sends the set of query results of the cloud storage nodes to the user.
可选的, 所述云存储设备将所述各云存储节点的查询结果按照用户自定 义规则进行排序, 并将排序后的查询结果集合发送给所述用户; 或者,  Optionally, the cloud storage device sorts the query results of the cloud storage nodes according to a user-defined rule, and sends the sorted query result set to the user; or
所述云存储设备将所述各云存储节点的查询结果按照节点顺序进行排 序, 并将排序后的查询结果集合发送给所述用户; 或者,  The cloud storage device sorts the query results of the cloud storage nodes in the order of the nodes, and sends the sorted query result sets to the user; or
所述云存储设备将所述各云存储节点的查询结果按照所述查询结果中的 关键字进行顺序, 并将排序后的查询结果集合发送给所述用户。  The cloud storage device sequentially performs the query result of each cloud storage node according to keywords in the query result, and sends the sorted query result set to the user.
如图 5B所示的一种数据存储、 数据查询的方法的示意图, 云存储设备通 过数据抽取进程获取外部数据源中待保存数据; 数据上传进程获取临时文件, 并上传到临时文件中转区中, 等待导入到云存储的分布式数据库中; 数据导 入进程获取临时文件中转区中的文件, 并行导入到云存储的分布式数据库中。 后期对分布式数据库中的数据进行管理时, 通过数据生命周期的规则对数据 进行压缩以及清除。 当用户需要对云存储的分布式数据库中的数据进行查询 时, 可以通过查询接口进行直接查询。 本发明实施例提供的一种数据存储、 数据查询的方法, 通过提供并行存 储以及并行数据查询, 可以提高存储和检索数据的速度, 并且降低管理难度。 As shown in FIG. 5B, a cloud storage device acquires data to be saved in an external data source through a data extraction process. The data upload process acquires a temporary file and uploads it to a temporary file transfer area. Waiting for import into the distributed database of the cloud storage; the data import process obtains the files in the transfer area of the temporary file and imports them into the distributed database of the cloud storage in parallel. When the data in the distributed database is managed in the later stage, the data is compressed and cleared by the rules of the data life cycle. When the user needs to query the data in the distributed database of the cloud storage, the user can directly query through the query interface. The data storage and data query method provided by the embodiment of the invention can improve the speed of storing and retrieving data and reduce the management difficulty by providing parallel storage and parallel data query.
本发明实施例提供一种数据存储的装置, 该装置可以为云存储设备, 如 图 6所示, 该装置包括: 获取模块 601 , 数据获取单元 6011 , 数据上传单元 6012 , 存储模块 602 , 分布单元 6021 , 存储单元 6022 , 初始配置模块 603 , 确定模块 604 , 管理模块 605;  An embodiment of the present invention provides a data storage device, where the device may be a cloud storage device. As shown in FIG. 6, the device includes: an obtaining module 601, a data acquiring unit 6011, a data uploading unit 6012, a storage module 602, and a distribution unit. 6021, a storage unit 6022, an initial configuration module 603, a determination module 604, a management module 605;
获取模块 601 , 用于获取待保存数据;  The obtaining module 601 is configured to obtain data to be saved;
存储模块 602 , 与获取模块 601连接, 用于将所述待保存数据均匀分布到 各云存储的数据节点上, 并行地将所述待保存数据存储到云存储的分布式数 据库中。  The storage module 602 is connected to the acquisition module 601, and is configured to evenly distribute the data to be saved to the data nodes of each cloud storage, and store the data to be saved in parallel to the distributed database of the cloud storage.
进一步的, 所述装置还包括初始配置模块 603 , 用于对云存储规则进行初 始配置, 包括定义目录与子业务类型的规则, 定义子业务类型对应文件的导 入规则, 定义数据的生命周期, 所述生命周期指按照时间定义每类数据的存 储策略; 以及对数据抽取规则进行初始配置, 包括抽取数据的数据源, 抽取 进程的数量, 每个抽取进程对应的数据范围; 以及对数据上传规则进行初始 配置, 包括上传进程的数量, 每个上传进程对应的数据范围。  Further, the device further includes an initial configuration module 603, configured to initially configure a cloud storage rule, including a rule for defining a directory and a sub-service type, defining an import rule of a file corresponding to the sub-service type, and defining a life cycle of the data. The life cycle refers to a storage strategy for defining each type of data according to time; and initial configuration of the data extraction rules, including data sources for extracting data, number of extraction processes, data ranges corresponding to each extraction process; and data uploading rules The initial configuration, including the number of upload processes, the data range corresponding to each upload process.
进一步的, 所述获取模块 601 包括: 数据获取单元 6011 , 数据上传单元 6012;  Further, the obtaining module 601 includes: a data acquiring unit 6011, a data uploading unit 6012;
可选的, 数据获取单元 6011 , 用于根据配置的所述数据抽取规则对外部 数据源中的数据进行抽取获得第一数据, 或者将所述外部数据源中的数据进 行格式转换后获取第二数据;  Optionally, the data obtaining unit 6011 is configured to extract, according to the configured data extraction rule, the data in the external data source to obtain the first data, or perform format conversion on the data in the external data source to obtain the second data. Data
数据上传单元 6012 , 与所述数据获取单元 6011连接, 用于根据云存储的 管理节点获取云存储的中转区路径; 以及根据配置的所述数据上传规则, 以 及所述中转区路径将所述第一数据或者第二数据保存到云存储的临时文件中 转区的相应目录中, 其中所述临时文件中转区中的各个目录下的各个文件以 文本文件格式保存。  The data uploading unit 6012 is connected to the data acquiring unit 6011, and configured to acquire a transit area path of the cloud storage according to the management node of the cloud storage; and the data uploading rule according to the configuration, and the path of the transit area The data or the second data is saved in a corresponding directory of the temporary file transfer area of the cloud storage, wherein each file in each directory in the temporary file transfer area is saved in a text file format.
进一步的, 所述装置还包括确定模块 604; 所述确定模块 604与所述存储 模块 602 连接, 用于根据所述临时文件中转区的相应目录, 以及配置的所述 云存储规则中目录与子业务类型的规则, 确定所述待保存数据的子业务类型; 所述存储模块 602 ,用于根据配置的所述云存储规则中子业务类型对应文 件的导入规则, 将所述待保存数据均勾分布到各云存储的数据节点上, 并行 地将所述待保存数据存储到云存储的分布式数据库中。 Further, the device further includes a determining module 604; the determining module 604 and the storing The module 602 is configured to determine, according to the corresponding directory of the temporary file transfer area, and the configured rules of the directory and the sub-service type in the cloud storage rule, the sub-service type of the data to be saved; the storage module 602 And storing, according to the imported import rule of the sub-service type corresponding file in the cloud storage rule, the data to be saved is distributed to the data nodes of each cloud storage, and the data to be saved is stored in the cloud in parallel Stored in a distributed database.
进一步的, 所述存储模块 602中的分布单元 6021 , 用于用于将所述待保 存数据根据哈希算法均匀分布到各云存储数据节点上;  Further, the distribution unit 6021 in the storage module 602 is configured to uniformly distribute the to-be-stored data to each cloud storage data node according to a hash algorithm;
所述存储模块 602中的存储单元 6022 , 用于将所述各云存储数据节点上 的不同的所述待保存数据同时存储到云存储的分布式数据库中, 或者, 将所 述各云存储数据节点上的同一个所述待保存数据进行拆分后的片段同时存储 到云存储的分布式数据库中。  The storage unit 6022 of the storage module 602 is configured to simultaneously store different data to be saved on the cloud storage data nodes into a distributed database of the cloud storage, or store the data in the cloud storage. The split segments of the same data to be saved on the node are simultaneously stored in a distributed database of cloud storage.
当将所述待保存数据均勾分布到各云存储数据节点上, 并行地将所述待 保存数据存储到云存储的分布式数据库中时, 随着所述云存储数据节点的增 加, 自动增加并行存储的并行度。  When the data to be saved is distributed to each cloud storage data node, and the data to be saved is stored in the distributed database of the cloud storage in parallel, the cloud storage data node increases automatically as the cloud storage data node increases. Parallelism of parallel storage.
所述存储模块 602还用于: 根据配置的所述云存储规则对保存在所述分 布式数据库中的不同用途的数据保存为不同的份数, 其中, 所述不同用途的 数据包括生产数据和备份数据, 所述生产数据供查询时使用。  The storage module 602 is further configured to: save, according to the configured cloud storage rule, data of different uses stored in the distributed database as different number of copies, where the data of the different uses includes production data and Backing up data, the production data is used for querying.
所述装置还包括: 管理模块 605 , 用于根据配置的所述云存储规则中数据 生命周期的规则, 对所述云存储的分布式数据库中的所述待保存数据的不同 时期进行不同的处理。 例如针对详单业务可以保存 3份数据, 第 1份与第 2 份为生产数据, 供查询时使用, 并且不进行压缩, 第 3份为备份数据, 进行 中密度压缩。  The device further includes: a management module 605, configured to perform different processing on different periods of the data to be saved in the distributed database of the cloud storage according to the configured rules of data lifecycle in the cloud storage rule . For example, for the detailed business, 3 copies of data can be saved, the first and second copies are production data, which are used for query, and are not compressed, and the third is backup data for medium density compression.
本发明实施例提供一种数据存储的装置, 通过获取模块获取待保存数据; 存储模块将所述待保存数据均匀分布到各云存储的数据节点上, 并行地将所 述待保存数据存储到云存储的分布式数据库中, 可以提高数据存储的速度。  An embodiment of the present invention provides a data storage device, which acquires data to be saved by using an acquiring module. The storage module uniformly distributes the data to be saved to data nodes of each cloud storage, and stores the data to be saved to the cloud in parallel. In a distributed distributed database, you can increase the speed of data storage.
本发明实施例提供一种数据查询的装置, 该装置可以为云存储设备, 如 图 7所示, 该装置包括: 获取模块 701 , 处理模块 702 , 发送单元 7021 , 处理 单元 7022; An embodiment of the present invention provides a device for querying data. The device may be a cloud storage device. As shown in FIG. 7, the device includes: an obtaining module 701, a processing module 702, and a sending unit 7021. Unit 7022;
获取模块 701 , 用于获取用户输入的索引字段,根据所述索引字段生成查 询指令;  The obtaining module 701 is configured to obtain an index field input by the user, and generate a query instruction according to the index field;
处理模块 702 , 用于将所述查询指令发送到各云存储的数据节点, 并行地 在在云存储的分布式数据库中查询数据; 以及将所述各云存储节点的查询结 果的集合发送给所述用户。  The processing module 702 is configured to send the query instruction to the data nodes of each cloud storage, query data in a distributed database of the cloud storage in parallel, and send the set of the query results of the cloud storage nodes to the User.
保存的数据可以分为生产数据以及备份数据, 查询时仅对生产数据进行 查询, 当生产数据破坏时, 可以采用备份数据对生产数据进行恢复。 查询数 据时并不关心数据的存储位置以及是否压缩。  The saved data can be divided into production data and backup data. When querying, only the production data is queried. When the production data is destroyed, the backup data can be used to recover the production data. When querying data, you don't care where the data is stored and whether it is compressed.
其中, 所述处理模块 702中的发送单元 7021 , 用于将所述查询指令同时 发送到各云存储的数据节点上; 所述处理模块 702中的处理单元 7022 , 用于 同时在所述各云存储的数据节点上承载的分布式数据库中查询符合所述查询 指令的数据。  The sending unit 7021 in the processing module 702 is configured to send the query instruction to the data nodes of each cloud storage at the same time. The processing unit 7022 in the processing module 702 is configured to simultaneously be in the cloud. The distributed database that is carried on the stored data node queries the data that meets the query instruction.
可选的, 所述处理模块 702 , 用于:  Optionally, the processing module 702 is configured to:
将所述各云存储节点的查询结果按照用户自定义规则进行排序, 并将排 序后的查询结果集合发送给所述用户; 或者,  Querying the query results of the cloud storage nodes according to a user-defined rule, and sending the sorted query result set to the user; or
将所述各云存储节点的查询结果按照节点顺序进行排序, 并将排序后的 查询结果集合发送给所述用户; 或者,  Querying the query results of the cloud storage nodes in a node order, and sending the sorted query result sets to the user; or
将所述各云存储节点的查询结果按照所述查询结果中的关键字进行顺 序, 并将排序后的查询结果集合发送给所述用户。  The query results of the cloud storage nodes are sequenced according to keywords in the query result, and the sorted query result sets are sent to the user.
本发明实施例提供一种数据查询的装置, 通过接收模块生成的查询指令, 处理模块同时并行地在数据库中查询数据, 使得可以极大地提升查询性能。  The embodiment of the invention provides a device for querying data. The processing module queries the data in the database in parallel by the query module generated by the module, so that the query performance can be greatly improved.
需要说明的是, 附图 6与附图 7所示的装置可以为同一个装置, 为云存 储设备, 即云存储设备可以同时执行数据存储与数据查询的功能。  It should be noted that the devices shown in FIG. 6 and FIG. 7 may be the same device, and the cloud storage device, that is, the cloud storage device, can perform data storage and data query functions at the same time.
本发明实施例提供一种数据存储的系统, 如同 8所示, 包括终端 801和 云存储设备 802;  The embodiment of the present invention provides a data storage system, as shown in FIG. 8, including a terminal 801 and a cloud storage device 802;
所述终端 801 , 用于根据配置的数据抽取规则对数据源中的数据进行抽 取, 获得第一数据; 将所述第一数据保存在临时文件夹中, 以便云存储设备 根据数据上传规则, 以及获取的中转区路径将所述临时文件夹中的所述第一 数据上传到所述云存储设备的临时文件中转区的相应目录中; The terminal 801 is configured to pump data in the data source according to the configured data extraction rule. Obtaining, obtaining the first data; saving the first data in a temporary folder, so that the cloud storage device uploads the first data in the temporary folder according to the data uploading rule and the obtained transit zone path to The corresponding directory of the temporary file transfer area of the cloud storage device;
所述云存储设备 802, 用于根据配置的所述数据上传规则, 以及所述中转 区路径, 将所述终端中的临时文件夹中的所述第一数据上传到云存储的临时 文件中转区的相应目录中; 将所述临时文件中转区的相应目录中的所述第一 数据均匀分布到各云存储数据节点上, 并行地将所述待保存数据存储到云存 储的分布式数据库中。  The cloud storage device 802 is configured to upload the first data in the temporary folder in the terminal to a temporary file transfer area of the cloud storage according to the configured data uploading rule and the transit zone path. And correspondingly distributing the first data in the corresponding directory of the temporary file transfer area to each cloud storage data node, and storing the to-be-saved data in parallel to the distributed database of the cloud storage.
其中, 所述数据源为所述终端 801 中保存的数据。 每个数据抽取进程会 按照配置的数据抽取规则进行数据抽取, 并将抽取的数据保存到该数据抽取 进程对应的目录中, 每个目录中包括多个临时文件, 如将抽取的第一数据保 存为第一文件, 并将第一文件保存到数据抽取进程对应的第一目录中。 负责 抽取济南 1360640号段 1号账期, 2011年 12月 GSM详单的进程, 对应的目录 为 /531/gsm_cdr/201112/01/1360640。  The data source is data stored in the terminal 801. Each data extraction process performs data extraction according to the configured data extraction rules, and saves the extracted data to a directory corresponding to the data extraction process, and each directory includes multiple temporary files, such as saving the first data to be extracted. Is the first file, and saves the first file to the first directory corresponding to the data extraction process. Responsible for the extraction of Jinan 1360640 paragraph 1 account period, December 2011 GSM detailed list process, the corresponding directory is /531/gsm_cdr/201112/01/1360640.
当临时文件的大小或保存的号码数量达到配置的数据抽取规则阈值后, 产生一个新的文件。 例如, GSM_531_20111201_1360640.0020 代表济南市 1360640号段 01帐期在 2011年 12月份的话单文件, 序号为 0020, 当此话单 文件保存的号码数量达到配置的数据抽取规则 阈值后, 产生 GSM_531-20111201_1360640.0021文件。  A new file is generated when the size of the temporary file or the number of saved numbers reaches the configured data extraction rule threshold. For example, GSM_531_20111201_1360640.0020 represents the CDR file of the 16000640 segment of Jinan City in December 2011, the serial number is 0020. When the number of numbers saved in this CDR file reaches the configured data extraction rule threshold, GSM_531-20111201_1360640 is generated. 0021 file.
其中, 云存储设备 802可以为图 6所述的数据存储的装置。  The cloud storage device 802 can be the device for data storage as described in FIG.
当要存储到云存储的分布式数据库中的数据量非常大时, 或者网络条件 不好时, 在导入过程中可能会发生中断, 因此可以通过本发明实施例中将抽 取的第一数据保存在终端的临时文件夹中, 再将数据上传到建立的临时文件 中转区, 可以保证数据完整的上传到云存储的第二目录中后, 再导入到云存 储的分布式数据库中, 提高传输的数据的完整性。  When the amount of data to be stored in the distributed database of the cloud storage is very large, or when the network condition is not good, an interruption may occur during the import process, so that the first data extracted may be saved in the embodiment of the present invention. In the temporary folder of the terminal, upload the data to the created temporary file transfer area, and ensure that the data is completely uploaded to the second directory of the cloud storage, and then imported into the distributed database of the cloud storage to improve the transmitted data. Integrity.
以上所述, 仅为本发明的具体实施方式, 但本发明的保护范围并不局限 于此, 任何熟悉本技术领域的技术人员在本发明揭露的技术范围内, 可轻易 想到变化或替换, 都应涵盖在本发明的保护范围之内。 因此, 本发明的保护 范围应所述以权利要求的保护范围为准。 The above is only a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto, and any person skilled in the art can easily within the technical scope disclosed by the present invention. Any changes or substitutions are contemplated as being within the scope of the invention. Therefore, the scope of the invention should be determined by the scope of the claims.

Claims

权利 要求 书 claims
1、 一种数据存储的方法, 其特征在于, 包括: 1. A data storage method, characterized by including:
云存储设备获取待保存数据; The cloud storage device obtains the data to be saved;
所述云存储设备将所述待保存数据均匀分布到各云存储数据节点上, 并行 地将所述待保存数据存储到云存储的分布式数据库中。 The cloud storage device evenly distributes the data to be saved to each cloud storage data node, and stores the data to be saved in a distributed database of cloud storage in parallel.
2、 根据权利要求 1所述的方法, 其特征在于, 所述云存储设备将所述待保 存数据均匀分布到各云存储数据节点上, 并行地将所述待保存数据存储到云存 储的分布式数据库中, 包括: 2. The method according to claim 1, characterized in that, the cloud storage device evenly distributes the data to be saved to each cloud storage data node, and stores the data to be saved to the cloud storage distribution in parallel. database, including:
所述云存储设备将所述待保存数据根据哈希算法均匀分布到各云存储数据 节点上; The cloud storage device evenly distributes the data to be saved to each cloud storage data node according to a hash algorithm;
所述云存储设备将所述各云存储数据节点上的不同的所述待保存数据同时 存储到云存储的分布式数据库中, 或者, 将所述各云存储数据节点上的同一个 所述待保存数据进行拆分后的片段同时存储到云存储的分布式数据库中。 The cloud storage device simultaneously stores the different data to be saved on each cloud storage data node into a distributed database of cloud storage, or stores the same data to be saved on each cloud storage data node. The fragments after saving and splitting the data are simultaneously stored in the distributed database of cloud storage.
3、 根据权利要求 2所述的方法, 其特征在于, 当所述云存储设备将所述待 保存数据均勾分布到各云存储数据节点上, 并行地将所述待保存数据存储到云 存储的分布式数据库中时, 随着所述云存储数据节点的增加, 自动增加并行存 储的并行度。 3. The method according to claim 2, characterized in that when the cloud storage device distributes the data to be saved to each cloud storage data node, the data to be saved is stored in the cloud storage in parallel. In the distributed database, as the cloud storage data nodes increase, the parallelism of the parallel storage is automatically increased.
4、 根据权利要求 1所述的方法, 其特征在于, 在云存储设备获取待保存数 据之前, 所述方法还包括: 4. The method according to claim 1, characterized in that, before the cloud storage device obtains the data to be saved, the method further includes:
对云存储规则进行初始配置, 包括定义目录与子业务类型的规则, 定义子 业务类型对应文件的导入规则, 定义数据的生命周期, 所述生命周期指按照时 间定义每类数据的存储策略; Initial configuration of cloud storage rules, including rules for defining directories and sub-business types, defining import rules for files corresponding to sub-business types, and defining the life cycle of data. The life cycle refers to defining the storage strategy for each type of data according to time;
对数据抽取规则进行初始配置, 包括抽取数据的数据源, 抽取进程的数量, 每个抽取进程对应的数据范围; Initial configuration of data extraction rules, including the data source of extracted data, the number of extraction processes, and the data range corresponding to each extraction process;
对数据上传规则进行初始配置, 包括上传进程的数量, 每个上传进程对应 的数据范围。 Initial configuration of data upload rules, including the number of upload processes and the data range corresponding to each upload process.
5、根据权利要求 4所述的方法, 其特征在于, 所述获取待保存数据, 包括: 根据配置的所述数据抽取规则对外部数据源中的数据进行抽取获得第一数 据, 或者将所述外部数据源中的数据进行格式转换后获取第二数据; 5. The method according to claim 4, characterized in that: obtaining the data to be saved includes: Extract the data in the external data source according to the configured data extraction rules to obtain the first data, or convert the format of the data in the external data source to obtain the second data;
根据云存储的管理节点获取云存储的中转区路径; Obtain the transit area path of cloud storage according to the management node of cloud storage;
根据配置的所述数据上传规则, 以及所述中转区路径将所述第一数据或者 第二数据保存到云存储的临时文件中转区的相应目录中, 其中所述临时文件中 The first data or the second data are saved to the corresponding directory of the temporary file transfer area of the cloud storage according to the configured data upload rules and the transfer area path, wherein the temporary file
6、 根据权利要求 5所述的方法, 其特征在于, 在所述云存储设备将所述待 保存数据均勾分布到各云存储数据节点上, 并行地将所述待保存数据存储到云 存储的分布式数据库中之前, 还包括: 6. The method according to claim 5, characterized in that, the cloud storage device evenly distributes the data to be saved to each cloud storage data node, and stores the data to be saved to the cloud storage in parallel. Before the distributed database, it also included:
根据所述临时文件中转区的相应目录, 以及配置的所述云存储规则中目录 与子业务类型的规则, 确定所述待保存数据的子业务类型; Determine the sub-service type of the data to be saved according to the corresponding directory of the temporary file transfer area and the configured rules of the directory and sub-service type in the cloud storage rules;
根据配置的所述云存储规则中子业务类型对应文件的导入规则, 将所述待 保存数据均匀分布到各云存储的数据节点上, 并行地将所述待保存数据存储到 云存储的分布式数据库中。 According to the configured import rules of files corresponding to sub-business types in the cloud storage rules, the data to be saved is evenly distributed to the data nodes of each cloud storage, and the data to be saved is stored in the distributed cloud storage in parallel. in the database.
7、 根据权利要求 6所述的方法, 其特征在于, 在所述将所述待保存数据均 匀分布到各云存储数据节点上, 并行地将所述待保存数据存储到云存储的数据 库中之后, 还包括: 7. The method according to claim 6, characterized in that, after uniformly distributing the data to be saved to each cloud storage data node and storing the data to be saved in a cloud storage database in parallel , Also includes:
根据配置的所述云存储规则中数据生命周期的规则, 对所述云存储的分布 式数据库中的所述待保存数据的不同时期进行不同的处理。 According to the configured data life cycle rules in the cloud storage rules, different processing is performed on the data to be saved in the distributed database of the cloud storage in different periods.
8、 根据权利要求 4所述的方法, 其特征在于, 在所述并行地将所述待保存 数据存储到云存储的分布式数据库中之后, 还包括: 8. The method according to claim 4, characterized in that, after the parallel storage of the data to be saved into the distributed database of cloud storage, it further includes:
根据配置的所述云存储规则对保存在所述分布式数据库中的不同用途的数 据保存为不同的份数, 其中, 所述不同用途的数据包括生产数据和备份数据, 所述生产数据供查询时使用。 According to the configured cloud storage rules, data of different uses stored in the distributed database are saved in different numbers, wherein the data of different uses include production data and backup data, and the production data is available for query. when used.
9、 一种数据查询的方法, 其特征在于, 包括: 9. A data query method, characterized by including:
云存储设备获取用户输入的索引字段, 根据所述索引字段生成查询指令; 所述云存储设备将所述查询指令发送到各云存储的数据节点, 并行地在云 存储的分布式数据库中查询数据; The cloud storage device obtains the index field input by the user and generates a query instruction according to the index field; the cloud storage device sends the query instruction to each cloud storage data node, and performs the query on the cloud in parallel. Query data in the stored distributed database;
所述云存储设备将所述各云存储节点的查询结果的集合发送给所述用户。 The cloud storage device sends a set of query results of each cloud storage node to the user.
10、 根据权利要求 9 所述的方法, 其特征在于, 所述云存储设备将所述查 询指令发送到各云存储的数据节点, 并行地在云存储的数据库中查询数据, 包 括: 10. The method according to claim 9, characterized in that the cloud storage device sends the query instruction to the data nodes of each cloud storage, and queries the data in the cloud storage database in parallel, including:
所述云存储设备将所述查询指令同时发送到各云存储的数据节点上; 所述云存储设备同时在所述各云存储的数据节点上承载的分布式数据库中 查询符合所述查询指令的数据。 The cloud storage device sends the query instruction to the data nodes of each cloud storage at the same time; the cloud storage device simultaneously queries the distributed database carried on the data nodes of each cloud storage for information that meets the query instruction. data.
11、 根据权利要求 6 所述的方法, 其特征在于, 所述云存储设备将所述各 云存储节点的查询结果的集合发送给所述用户, 包括: 11. The method according to claim 6, characterized in that the cloud storage device sends a set of query results of each cloud storage node to the user, including:
所述云存储设备将所述各云存储节点的查询结果按照用户自定义规则进行 排序, 并将排序后的查询结果集合发送给所述用户; 或者, The cloud storage device sorts the query results of each cloud storage node according to user-defined rules, and sends the sorted query result set to the user; or,
所述云存储设备将所述各云存储节点的查询结果按照节点顺序进行排序, 并将排序后的查询结果集合发送给所述用户; 或者, The cloud storage device sorts the query results of each cloud storage node in node order, and sends the sorted query result set to the user; or,
所述云存储设备将所述各云存储节点的查询结果按照所述查询结果中的关 键字进行顺序, 并将排序后的查询结果集合发送给所述用户。 The cloud storage device sorts the query results of each cloud storage node according to the keywords in the query results, and sends the sorted query result set to the user.
12、 一种数据存储的装置, 其特征在于, 包括: 12. A data storage device, characterized by including:
获取模块, 用于获取待保存数据; Obtain module, used to obtain data to be saved;
存储模块, 用于将所述待保存数据均勾分布到各云存储的数据节点上, 并 行地将所述待保存数据存储到云存储的分布式数据库中。 The storage module is used to evenly distribute the data to be saved to the data nodes of each cloud storage, and store the data to be saved in the distributed database of the cloud storage in parallel.
1 3、 根据权利要求 12所述的装置, 其特征在于, 所述存储模块, 包括: 分布单元, 用于将所述待保存数据根据哈希算法均匀分布到各云存储数据 节点上; 13. The device according to claim 12, characterized in that the storage module includes: a distribution unit for evenly distributing the data to be saved to each cloud storage data node according to a hash algorithm;
存储单元, 用于将所述各云存储数据节点上的不同的所述待保存数据同时 存储到云存储的分布式数据库中, 或者, 将所述各云存储数据节点上的同一个 所述待保存数据进行拆分后的片段同时存储到云存储的分布式数据库中。 A storage unit configured to store different data to be saved on each cloud storage data node into a distributed database of cloud storage at the same time, or to store the same data to be saved on each cloud storage data node. The fragments after saving and splitting the data are simultaneously stored in the distributed database of cloud storage.
14、 根据权利要求 1 3所述的装置, 其特征在于, 当将所述待保存数据均匀 分布到各云存储数据节点上, 并行地将所述待保存数据存储到云存储的分布式 数据库中时, 随着所述云存储数据节点的增加, 自动增加并行存储的并行度。 14. The device according to claim 13, characterized in that when the data to be saved is evenly Distributed to each cloud storage data node, when the data to be saved is stored in the distributed database of cloud storage in parallel, as the number of cloud storage data nodes increases, the degree of parallelism of the parallel storage is automatically increased.
15、 根据权利要求 12所述的装置, 其特征在于, 所述装置, 还包括: 初始配置模块, 用于对云存储规则进行初始配置, 包括定义目录与子业务 类型的规则, 定义子业务类型对应文件的导入规则, 定义数据的生命周期, 所 述生命周期指按照时间定义每类数据的存储策略; 15. The device according to claim 12, characterized in that the device further includes: an initial configuration module for initial configuration of cloud storage rules, including rules for defining directories and sub-service types, and defining sub-service types. The import rules of the corresponding files define the life cycle of the data. The life cycle refers to defining the storage strategy of each type of data according to time;
以及对数据抽取规则进行初始配置, 包括抽取数据的数据源, 抽取进程的 数量, 每个抽取进程对应的数据范围; And perform initial configuration of data extraction rules, including the data source of extracted data, the number of extraction processes, and the data range corresponding to each extraction process;
以及对数据上传规则进行初始配置, 包括上传进程的数量, 每个上传进程 对应的数据范围。 And perform initial configuration of data upload rules, including the number of upload processes and the data range corresponding to each upload process.
16、 根据权利要求 15所述的装置, 其特征在于, 所述获取模块, 包括: 数据获取单元, 用于根据配置的所述数据抽取规则对外部数据源中的数据 进行抽取获得第一数据, 或者将所述外部数据源中的数据进行格式转换后获取 第二数据; 16. The device according to claim 15, characterized in that, the acquisition module includes: a data acquisition unit, configured to extract data from an external data source according to the configured data extraction rules to obtain the first data, Or obtain the second data after format conversion of the data in the external data source;
数据上传单元, 用于根据云存储的管理节点获取云存储的中转区路径; 以 及根据配置的所述数据上传规则, 以及所述中转区路径将所述第一数据或者第 二数据保存到云存储的临时文件中转区的相应目录中, 其中所述临时文件中转 A data upload unit, configured to obtain the transit area path of the cloud storage according to the management node of the cloud storage; and save the first data or the second data to the cloud storage according to the configured data upload rules and the transit area path. in the corresponding directory of the temporary file transfer area, where the temporary file transfer
17、 根据权利要求 16中所述的装置, 其特征在于, 所述装置, 还包括: 确定模块, 用于根据所述临时文件中转区的相应目录, 以及配置的所述云 存储规则中目录与子业务类型的规则, 确定所述待保存数据的子业务类型; 所述存储模块, 用于根据配置的所述云存储规则中子业务类型对应文件的 导入规则, 将所述待保存数据均勾分布到各云存储的数据节点上, 并行地将所 述待保存数据存储到云存储的分布式数据库中。 17. The device according to claim 16, characterized in that, the device further includes: a determination module, configured to determine the corresponding directory according to the temporary file transfer area, and the directory and directory in the configured cloud storage rule. The rules of the sub-business type determine the sub-business type of the data to be saved; the storage module is used to check all the data to be saved according to the import rules of the files corresponding to the sub-business type in the configured cloud storage rules. Distributed to the data nodes of each cloud storage, and store the data to be saved in the distributed database of the cloud storage in parallel.
18、 根据权利要求 17所述的装置, 其特征在于, 所述装置, 还包括: 管理模块, 用于根据配置的所述云存储规则中数据生命周期的规则, 对所 述云存储的分布式数据库中的所述待保存数据的不同时期进行不同的处理。 18. The device according to claim 17, characterized in that, the device further includes: a management module, configured to manage the distributed distribution of the cloud storage according to the configured data life cycle rules in the cloud storage rules. The data to be saved in the database is processed differently at different periods.
19、 根据权利要求 15所述的装置, 其特征在于, 所述存储模块还用于: 根据配置的所述云存储规则对保存在所述分布式数据库中的不同用途的数 据保存为不同的份数, 其中, 所述不同用途的数据包括生产数据和备份数据, 所述生产数据供查询时使用。 19. The device according to claim 15, wherein the storage module is further configured to: save data for different purposes stored in the distributed database into different parts according to the configured cloud storage rules. Number, wherein the data for different purposes includes production data and backup data, and the production data is used for query.
20、 一种数据查询的装置, 其特征在于, 包括: 20. A data query device, characterized by including:
获取模块, 用于获取用户输入的索引字段, 根据所述索引字段生成查询指 令; The acquisition module is used to obtain the index fields input by the user and generate query instructions based on the index fields;
处理模块, 用于将所述查询指令发送到各云存储的数据节点, 并行地在在 云存储的分布式数据库中查询数据; 以及将所述各云存储节点的查询结果的集 合发送给所述用户。 A processing module, configured to send the query instructions to the data nodes of each cloud storage, query data in the distributed database of the cloud storage in parallel; and send a set of query results of each cloud storage node to the user.
21、 根据权利要求 20所述的装置, 其特征在于, 所述处理模块, 包括: 发送单元, 用于将所述查询指令同时发送到各云存储的数据节点上; 处理单元, 用于同时在所述各云存储的数据节点上承载的分布式数据库中 查询符合所述查询指令的数据。 21. The device according to claim 20, characterized in that the processing module includes: a sending unit for sending the query instructions to the data nodes of each cloud storage at the same time; a processing unit for simultaneously sending Query data that conforms to the query instructions in the distributed database carried on the data nodes of each cloud storage.
22、 根据权利要求 20所述的装置, 其特征在于, 所述处理模块, 用于: 将所述各云存储节点的查询结果按照用户自定义规则进行排序, 并将排序 后的查询结果集合发送给所述用户; 或者, 22. The device according to claim 20, characterized in that the processing module is configured to: sort the query results of each cloud storage node according to user-defined rules, and send the sorted query result set to said user; or,
将所述各云存储节点的查询结果按照节点顺序进行排序, 并将排序后的查 询结果集合发送给所述用户; 或者, Sort the query results of each cloud storage node according to the node order, and send the sorted query result set to the user; or,
将所述各云存储节点的查询结果按照所述查询结果中的关键字进行顺序, 并将排序后的查询结果集合发送给所述用户。 The query results of each cloud storage node are ordered according to the keywords in the query results, and the sorted query result set is sent to the user.
23、 一种数据存储系统, 其特征在于, 包括: 终端和云存储设备; 所述终端, 用于根据配置的数据抽取规则对数据源中的数据进行抽取, 获 得第一数据; 将所述第一数据保存在临时文件夹中, 以便云存储设备根据数据 上传规则, 以及获取的中转区路径将所述临时文件夹中的所述第一数据上传到 所述云存储设备的临时文件中转区的相应目录中; 23. A data storage system, characterized in that it includes: a terminal and a cloud storage device; the terminal is used to extract data from the data source according to configured data extraction rules to obtain the first data; One data is saved in a temporary folder, so that the cloud storage device uploads the first data in the temporary folder to the temporary file transfer area of the cloud storage device according to the data upload rules and the obtained transfer area path. in the corresponding directory;
所述云存储设备, 用于根据配置的所述数据上传规则, 以及所述中转区路 径, 将所述终端中的临时文件夹中的所述第一数据上传到云存储的临时文件中 转区的相应目录中; 将所述临时文件中转区的相应目录中的所述第一数据均匀 分布到各云存储数据节点上, 并行地将所述待保存数据存储到云存储的分布式 数据库中。 The cloud storage device is used to upload data according to the configured data upload rules, and the transit area route Path, upload the first data in the temporary folder in the terminal to the corresponding directory of the temporary file transfer area of the cloud storage; evenly distribute the first data in the corresponding directory of the temporary file transfer area Distribute to each cloud storage data node, and store the data to be saved in the distributed database of cloud storage in parallel.
24、 根据权利要求 23所述的系统, 其特征在于, 所述云存储设备为上述权 利要求 12-权利要求 19所述的数据存储装置。 24. The system according to claim 23, characterized in that the cloud storage device is the data storage device described in claims 12 to 19.
PCT/CN2012/079155 2012-07-25 2012-07-25 Method and apparatus for data storage and query WO2014015488A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201280000916.6A CN102906751B (en) 2012-07-25 2012-07-25 A kind of method of data storage, data query and device
PCT/CN2012/079155 WO2014015488A1 (en) 2012-07-25 2012-07-25 Method and apparatus for data storage and query

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2012/079155 WO2014015488A1 (en) 2012-07-25 2012-07-25 Method and apparatus for data storage and query

Publications (1)

Publication Number Publication Date
WO2014015488A1 true WO2014015488A1 (en) 2014-01-30

Family

ID=47577492

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2012/079155 WO2014015488A1 (en) 2012-07-25 2012-07-25 Method and apparatus for data storage and query

Country Status (2)

Country Link
CN (1) CN102906751B (en)
WO (1) WO2014015488A1 (en)

Families Citing this family (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103116661B (en) * 2013-03-20 2016-01-27 广东宜通世纪科技股份有限公司 A kind of data processing method of database
CN103207919A (en) * 2013-04-26 2013-07-17 北京亿赞普网络技术有限公司 Method and device for quickly inquiring and calculating MangoDB cluster
CN104123300B (en) * 2013-04-26 2017-10-13 上海云人信息科技有限公司 Data distribution formula storage system and method
CN104426942A (en) * 2013-08-27 2015-03-18 鸿富锦精密工业(深圳)有限公司 File uploading method and system
CN104424109B (en) * 2013-09-09 2020-03-24 联想(北京)有限公司 Information processing method and electronic equipment
CN103458055A (en) * 2013-09-22 2013-12-18 广州中国科学院软件应用技术研究所 Clout competing platform
CN103841177A (en) * 2013-11-08 2014-06-04 汉柏科技有限公司 Cloud computing data storage system and method based on gridding
CN103685488B (en) * 2013-12-03 2017-08-11 华为软件技术有限公司 The control method and Dropbox of resource in Dropbox
WO2015123809A1 (en) * 2014-02-18 2015-08-27 华为技术有限公司 Data table importing method, data manager and server
CN104049917A (en) * 2014-06-25 2014-09-17 北京思特奇信息技术股份有限公司 Data processing method and system
TWI604320B (en) * 2014-08-01 2017-11-01 緯創資通股份有限公司 Methods for accessing big data and systems using the same
CN104572862A (en) * 2014-12-19 2015-04-29 阳珍秀 Mass data storage access method and system
CN104618482B (en) * 2015-02-02 2019-07-16 浙江宇视科技有限公司 Access method, server, conventional memory device, the system of cloud data
CN106156209A (en) * 2015-04-23 2016-11-23 中兴通讯股份有限公司 Data processing method and device
CN105022833A (en) * 2015-08-10 2015-11-04 浪潮(北京)电子信息产业有限公司 Data processing method, nodes and monitoring system
CN106557469B (en) * 2015-09-24 2020-11-20 创新先进技术有限公司 Method and device for processing data in data warehouse
CN105912609B (en) * 2016-04-06 2019-04-02 中国农业银行股份有限公司 A kind of data file processing method and device
CN105938489A (en) * 2016-04-14 2016-09-14 北京思特奇信息技术股份有限公司 Storage and display method and system of compressed detailed lists
CN106372115A (en) * 2016-08-23 2017-02-01 成都乾威科技有限公司 Data reading/writing method and system, and database system
CN107967279A (en) * 2016-10-19 2018-04-27 北京国双科技有限公司 The data-updating method and device of distributed data base
CN106649530B (en) * 2016-10-21 2020-12-15 北京卡拉卡尔科技股份有限公司 Cloud detail query management system and method
CN106569896B (en) * 2016-10-25 2019-02-05 北京国电通网络技术有限公司 A kind of data distribution and method for parallel processing and system
CN107092700A (en) * 2017-05-02 2017-08-25 山东浪潮通软信息科技有限公司 It is a kind of based on the method and device for importing data under big data quantity in batches
CN108241742B (en) * 2018-01-02 2022-03-29 联想(北京)有限公司 Database query system and method
CN109521954B (en) * 2018-10-12 2021-11-16 许继集团有限公司 Distribution network FTU fixed point file management method and device
CN109447876A (en) * 2018-10-16 2019-03-08 湖北三峡云计算中心有限责任公司 A kind of burgher card system
CN111797422A (en) * 2019-04-09 2020-10-20 Oppo广东移动通信有限公司 Data privacy protection query method and device, storage medium and electronic equipment
US11100109B2 (en) 2019-05-03 2021-08-24 Microsoft Technology Licensing, Llc Querying data in a distributed storage system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102222090A (en) * 2011-06-02 2011-10-19 清华大学 Mass data resource management frame under cloud environment
CN102360390A (en) * 2011-10-24 2012-02-22 浙江大学 Knowledge cloud database retrieval method and system based on medical keywords

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101557551A (en) * 2009-05-11 2009-10-14 成都市华为赛门铁克科技有限公司 Cloud service accessing method, device and communication system thereof for mobile terminal

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102222090A (en) * 2011-06-02 2011-10-19 清华大学 Mass data resource management frame under cloud environment
CN102360390A (en) * 2011-10-24 2012-02-22 浙江大学 Knowledge cloud database retrieval method and system based on medical keywords

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHEN, YONG: "Design and implementation of communication data distributed query algorithm based on Hadoop platform", ELECTRONIC TECHNOLOGY & INFORMATION SCIENCE, CHINA MASTER'S THESES FULL-TEXT DATABASE, June 2010 (2010-06-01), pages 14 AND 31 - 40 *
WANG, PENG ET AL.: "Study of Realized Method on a Cloud Computer Architecture", COMPUTER ENGINEERING & SCIENCE, vol. 31, no. AL, October 2009 (2009-10-01), pages 11 - 13 *

Also Published As

Publication number Publication date
CN102906751B (en) 2015-12-02
CN102906751A (en) 2013-01-30

Similar Documents

Publication Publication Date Title
WO2014015488A1 (en) Method and apparatus for data storage and query
CN102790760B (en) Data synchronization method based on directory tree in safe network disc system
CN1318974C (en) Method for compression and search of database backup data
WO2016180055A1 (en) Method, device and system for storing and reading data
CN106161633B (en) Transmission method and system for packed files based on cloud computing environment
CN102456059A (en) Data deduplication processing system
WO2018036324A1 (en) Smart city information sharing method and device
CN105302920A (en) Optimal management method and system for cloud storage data
CN101937474A (en) Mass data query method and device
US11734229B2 (en) Reducing database fragmentation
CN104239377A (en) Platform-crossing data retrieval method and device
US20200065306A1 (en) Bloom filter partitioning
CN104050276A (en) Cache processing method and system of distributed database
CN108415671B (en) Method and system for deleting repeated data facing green cloud computing
US20200356528A1 (en) System and method for archive file check-in/out in an enterprise content management system
WO2022082891A1 (en) Big data acquisition method and system, and computer device and storage medium thereof
Upadhyay et al. Deduplication and compression techniques in cloud design
CN103823807A (en) Data de-duplication method, device and system
JP2022520654A (en) Data archiving methods and systems that utilize hybrid data storage
CN113486026A (en) Data processing method, device, equipment and medium
WO2009097710A1 (en) Method for organizing and retrieving files, module and system for organizing files and storage media thereof
CN104035943A (en) Data storage method and corresponding server
CN102737082A (en) Method and system for dynamically updating file data indexes
CN110913017A (en) File compression transmission method based on cloud desktop
Kaur et al. Image processing on multinode hadoop cluster

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 201280000916.6

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12881838

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 12881838

Country of ref document: EP

Kind code of ref document: A1