CN106569896B - A kind of data distribution and method for parallel processing and system - Google Patents

A kind of data distribution and method for parallel processing and system Download PDF

Info

Publication number
CN106569896B
CN106569896B CN201610956986.9A CN201610956986A CN106569896B CN 106569896 B CN106569896 B CN 106569896B CN 201610956986 A CN201610956986 A CN 201610956986A CN 106569896 B CN106569896 B CN 106569896B
Authority
CN
China
Prior art keywords
query
data
node
real
nodes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610956986.9A
Other languages
Chinese (zh)
Other versions
CN106569896A (en
Inventor
孙乔
付兰梅
邓卜侨
孙雷
陆旭
刘炜
崔伟
高阿朋
裴旭斌
吴舜
聂晓波
吴芃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Great Opensource Software Co ltd
State Grid Corp of China SGCC
State Grid Information and Telecommunication Co Ltd
State Grid Zhejiang Electric Power Co Ltd
State Grid Jibei Electric Power Co Ltd
Beijing Guodiantong Network Technology Co Ltd
Beijing China Power Information Technology Co Ltd
State Grid Eastern Inner Mongolia Power Co Ltd
Beijing Zhongdian Feihua Communication Co Ltd
Original Assignee
Beijing Great Opensource Software Co Ltd
State Grid Corp of China SGCC
State Grid Information and Telecommunication Co Ltd
State Grid Zhejiang Electric Power Co Ltd
State Grid Jibei Electric Power Co Ltd
Beijing Guodiantong Network Technology Co Ltd
Beijing Fibrlink Communications Co Ltd
State Grid Eastern Inner Mongolia Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Great Opensource Software Co Ltd, State Grid Corp of China SGCC, State Grid Information and Telecommunication Co Ltd, State Grid Zhejiang Electric Power Co Ltd, State Grid Jibei Electric Power Co Ltd, Beijing Guodiantong Network Technology Co Ltd, Beijing Fibrlink Communications Co Ltd, State Grid Eastern Inner Mongolia Power Co Ltd filed Critical Beijing Great Opensource Software Co Ltd
Priority to CN201610956986.9A priority Critical patent/CN106569896B/en
Publication of CN106569896A publication Critical patent/CN106569896A/en
Application granted granted Critical
Publication of CN106569896B publication Critical patent/CN106569896B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5017Task decomposition

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of data distribution and method for parallel processing and systems, which comprises parses to initial query instruction, obtains true inquiry instruction;True inquiry instruction is distributed in different back end by work distribution node;It executes the true inquiry instruction received respectively on each back end, and query result is returned into work distribution node, while keeping the update of data information table on each server node;On each server node, the corresponding inquiry table of the true inquiry instruction is backed up, so that preserving inquiry table copy on each server node;On each server node, when detecting the presence of concurrency conflict, by calling the inquiry table copy on current server node concurrently to execute true inquiry instruction.The data distribution and method for parallel processing and system can not only guarantee the consistency of data and can be improved the speed and efficiency of data query.

Description

Data distribution and parallel processing method and system
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a method and a system for data distribution and parallel processing.
Background
The parallel processing method is a processing mode which adopts a parallel means during information processing so as to achieve efficient development and calculation. In other words, it means that a plurality of jobs are completed in the same time frame, and the jobs may be the same in nature or different in nature. As long as there is an overlap in time, it may be referred to as parallelism. The method is mainly applied to the fields of high-performance processors, large-scale database management, complex mathematical modeling and the like, and the application range of the method is still continuously expanded. It mainly comprises three elements: simultaneity, concurrency, runny water. In order to achieve high performance and meet the requirement of large calculation amount, a parallel calculation model can be adopted for data processing.
Big data analytics are becoming more prevalent in many business areas, including financial enterprises, government agencies, and insurance agencies. Applications range from generating simple reports to performing complex analytical workloads. The difficulty lies in that: as the amount of data increases, how to develop analytical queries and store them in these areas. In general, data Consistency (Consistency), Availability (Availability), and Partition tolerance (Partition tolerance) need to be considered when designing a distributed system. Data consistency means that if data is copied to multiple servers, all servers need to be updated when the data is updated; availability refers to responding to user demand in a timely manner; partition fault tolerance is the scalability of the node. Massively Parallel processing mpp (massively Parallel processing) structures can address these challenges by distributing storage and querying across multiple nodes and processes. MPP can be connected through internet communication, and a terminal can be processed by a plurality of low-cost servers. In the MPP architecture, each module has a separate CPU, storage, etc. to manage data of each module and allow server nodes to be dynamically added or deleted. Thus, data and computing power between these modules is not shared and data is distributed to local storage.
Specifically, the MPP is formed by connecting a plurality of SMP (Symmetric Multi-Processor) servers through a certain node internet, and performs cooperative work to complete the same task. The system is basically characterized in that the system is formed by connecting a plurality of SMP servers, each SMP server is called as a node or a server node through a node interconnection network, each node only accesses local resources such as memory, storage and the like of the node, and the node is a completely shared-nothing (share nothing) structure; the MPP has better expansion capability, and theoretically, the expansion is unlimited. Referring to fig. 4, a schematic structural diagram of an MPP architecture is shown.
In an MPP system, the CPU within each node cannot access the memory of another node. Information interaction between nodes is realized through the node interconnection network, and the process is generally called data reallocation. The MPP server requires a complex mechanism to schedule and balance the load of the various nodes and the parallel processing process. Some current MPP technology-based servers tend to mask this complexity by system level software (e.g., databases). The MPP database distributes tasks to a plurality of servers and nodes in parallel, and after calculation of each node is completed, respective results are collected together to obtain a final result. Although the MPP can realize connection of CPUs in a larger scale, it is due to the characteristic of no sharing, so that there is a certain difficulty in maintaining data consistency. Because the MPP system does not share resources, the required occupied resources are more than that of the SMP, and when the communication time is more, the MPP system cannot fully exert the advantages of the resources to realize high efficiency.
Therefore, the inventor finds that the prior art has at least the following problems in the process of implementing the invention: the consistency of the data in the database is poor, and the query speed and efficiency of the data are low.
Disclosure of Invention
In view of this, the present invention provides a method and a system for data distribution and parallel processing, which can not only ensure the consistency of data, but also improve the speed and efficiency of data query.
The invention provides a data distribution and parallel processing method based on the above purpose, which comprises the following steps:
receiving an initial query instruction input by a user side;
analyzing the initial query instruction to obtain a real query statement which meets the standard and is used as a real query instruction;
sending the real query instruction to a work distribution node, and distributing the real query instruction to different data nodes through the work distribution node; the work distribution nodes and the data nodes are nodes which are divided by all server nodes in the database according to preset functions; the work distribution node is used for managing task distribution and query results of the data nodes;
executing the received real query instruction on each data node respectively, returning a query result to the work distribution node, and keeping the update of the data information table on each server node;
on each server node, backing up a query table corresponding to the real query instruction, so that a query table copy is stored on each server node;
on each server node, when a concurrency conflict is detected, the real query instruction is executed concurrently by calling the copy of the query table on the current server node.
Optionally, when it is detected that there is a concurrency conflict, the step of concurrently executing the real query instruction by calling the copy of the lookup table on the current server node includes:
detecting whether the real inquiry command has concurrency conflict in each server node;
if the concurrency conflict does not exist, submitting the query task; the query task is a query task corresponding to a real query instruction;
if the concurrency conflict exists, calling the self-adaptive backup on the current server node to be compared with the current query task, and executing the query task on the self-adaptive backup;
and if the query tasks in all the server nodes are successfully executed, returning a decision for submitting the query tasks, and otherwise, returning a decision for withdrawing the query tasks.
Optionally, the method for dividing all server nodes in the database into work distribution nodes and data nodes according to preset functions includes:
detecting the local storage capacity and the computing processing capacity of each server node;
judging whether the calculation processing capacity of the current server node is greater than a preset performance threshold value, if so, setting the current server node as a work distribution node, otherwise, setting the current server node as a data node;
or,
judging whether the local storage capacity of the current server node is greater than a preset storage threshold value or not, if so, setting the current server node as a data node; otherwise, the current server node is set as the working distribution node.
Optionally, the step of sending the real query instruction to a work distribution node, and distributing the real query instruction to different data nodes through the work distribution node further includes:
detecting the specific use condition of each server node;
and distributing data nodes and corresponding spaces on each server node by adopting a preset balance strategy according to the detection result.
Optionally, the balancing policy is a policy with the lowest space utilization rate.
Optionally, the step of maintaining the update of the data information table on each server node further includes:
and copying and storing the position information and the related operation information of the data information table.
Optionally, the step of maintaining the update of the data information table on each server node further includes:
all data in each data node are backed up to the designated data node;
keeping all data in each data node and synchronous updating of corresponding data backup;
the location information of each data node is stored on each data node.
Optionally, the step of parsing the initial query instruction further includes:
performing lexical analysis on characters input by a user side to obtain words meeting the standard;
carrying out syntactic analysis on a plurality of continuous words to obtain sentences conforming to syntactic logic, and simultaneously constructing to obtain an abstract syntactic tree;
and converting the logic SQL statement into a real SQL statement according to the abstract syntax tree to obtain a real query statement which meets the standard and is used as a real query instruction.
The invention also provides a data distribution and parallel processing system, which comprises:
the input unit is used for receiving an initial query instruction input by a user side;
the analysis unit is used for analyzing the initial query instruction to obtain a real query statement meeting the standard and using the real query statement as a real query instruction;
the distribution unit is used for sending the real query instruction to a work distribution node and distributing the real query instruction to different data nodes through the work distribution node; the work distribution nodes and the data nodes are nodes which are divided by all server nodes in the database according to preset functions; the work distribution node is used for managing task distribution and query results of the data nodes;
the processing unit is used for respectively executing the received real query instruction on each data node, returning a query result to the work distribution node, and simultaneously keeping the update of the data information table on each server node;
the backup unit is used for backing up the query table corresponding to the real query instruction on each server node, so that a query table copy is stored on each server node;
and the concurrency unit is used for executing the real query instruction in a concurrent manner by calling the copy of the query table on the current server node when the concurrency conflict is detected on each server node.
As can be seen from the above, according to the data distribution and parallel processing method and system provided by the invention, all server nodes in the database are divided into the work distribution nodes and the data nodes, so that the work distribution nodes can manage task distribution of the data nodes and return of query results; in this way, the data nodes can achieve load balancing according to different use states. The consistent synchronization of data in the server nodes can be ensured by keeping the data information table updated. By backing up the copy of the query table on each server node, each server node can execute the query instruction concurrently, so that the corresponding time is shortened, and the query efficiency is improved. By calling the adaptive backup to compare with the current query task when the concurrency conflicts, the consistency of the distributed data can be ensured. Therefore, the data distribution and parallel processing method and system not only can ensure the consistency of the data, but also can improve the speed and efficiency of data query.
Drawings
FIG. 1 is a flow chart of one embodiment of a data distribution and parallel processing method provided by the present invention;
FIG. 2 is a flow chart of the method for data distribution and parallel processing according to the present invention;
FIG. 3 is a flow diagram of one embodiment of a data distribution and parallel processing system provided by the present invention;
fig. 4 is a schematic diagram of an MPP structure provided by the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to specific embodiments and the accompanying drawings.
It should be noted that all expressions using "first" and "second" in the embodiments of the present invention are used for distinguishing two entities with the same name but different names or different parameters, and it should be noted that "first" and "second" are merely for convenience of description and should not be construed as limitations of the embodiments of the present invention, and they are not described in any more detail in the following embodiments.
Referring to fig. 1, a flowchart of an embodiment of a data distribution and parallel processing method provided by the present invention is shown. The data distribution and parallel processing method comprises the following steps:
step 101, receiving an initial query instruction input by a user side; the user usually inputs some continuous character strings at the user end, and the database usually needs to be parsed to obtain recognizable sentence or instruction information.
102, analyzing the initial query instruction to obtain a real query statement meeting the standard, and taking the real query statement as a real query instruction; the common query analysis part mainly comprises the steps of syntactic analysis, lexical analysis, semantic analysis, statement analysis and the like. The effect of the parsing is to convert an input character string into a structure describing the character string, so that the computer can more easily understand what the user input character string means. This stage includes three processes, lexical analysis, syntactic analysis, and output of abstract syntax trees. The lexical analyzer is a deterministic finite Automata (deterministic finite Automata) that can convert an input character set into words according to a lexical method defined by us. The lexical analysis is followed by a syntactic analysis, the result of which is used as input for the syntactic analysis, and the syntactic analysis is based on the lexical analysis to determine whether the word input by the user conforms to the syntactic logic. Taking SQL as an example, semantic analysis relates to relevant theories and concepts of SQL standard and SQL optimization, and includes logical analysis and physical analysis, and the logical analysis is a general mathematical analysis process and is independent of the underlying distributed environment. The physical analysis transforms the result of the logical analysis and is closely related to the underlying execution environment.
An Abstract Syntax Tree (AST) is a representation of a Tree structure of a user input sentence, each node on the Tree is a word, and the structure of the Tree embodies Syntax. The abstract syntax tree is constructed along with the process of syntax analysis, after the syntax analysis is finished, the syntax analyzer outputs an abstract syntax tree, and the input of a user corresponds to the structural content of the abstract syntax tree.
103, sending the real query instruction to a work distribution node, and distributing the real query instruction to different data nodes through the work distribution node; the work distribution nodes and the data nodes are nodes which are divided by all server nodes in the database according to preset functions; the work distribution node is used for managing task distribution and query results of the data nodes; in the MPP architecture, the whole database includes a large number of relatively independent servers, each serving as an independent node, so that the single server is referred to as a server node, a server, and the like, and the meaning of the actual representation is the same. Generally, a work distribution node can be arranged in a database to manage the rest of server nodes, namely data nodes; a plurality of work distribution nodes may be provided to manage the data nodes to which they belong, respectively. By setting the work distribution node management data node in this way, the factors such as the storage amount and the access pressure of each server node can be approximately equalized, for example: data in the data nodes with high load can be migrated to the data nodes with low load, and load balancing is achieved.
Step 104, executing the received real query instruction on each data node, returning the query result to the work distribution node, and keeping the update of the data information table on each server node; the data information table is used for recording the update of data and maintaining the consistency of the data on each server node through the communication between the server nodes; that is, any modification made to the data is recorded in the data information table, and maintaining the update of the data information table as described herein means that the modification is synchronized on the backup or copy thereof, so that the data information table can be guaranteed to be consistent.
Step 105, backing up the query table corresponding to the real query instruction on each server node, so that a query table copy is stored on each server node; the query table is a data table for storing the obtained query result after the real query instruction (for example, select [ id ] instruction and the like) is subjected to instruction query; all query instructions are also included in the lookup table. Based on a query task in a distributed database, a sequence of operations, any operation to a distribution eventually becomes a sequence of operations to the database store, the execution of which is also distributed. Therefore, the query table is backed up at each server node, so that each server node can execute the query instruction through the query table or the query table copy, the concurrent execution of the query instruction can be realized, and the query efficiency can be improved by maintaining the query table copy.
And 106, on each server node, when the concurrency conflict is detected, concurrently executing the real query instruction by calling the query table copy on the current server node. In the prior art, when a concurrency conflict occurs, the query can be performed only by waiting for global correspondence, so that the data query speed and efficiency are low, and the data consistency is poor. According to the invention, the query table copy is stored in advance, and then the query task is executed by calling the query table copy when the concurrency conflict is met, so that the concurrency execution of the query statement is realized, the response time can be greatly shortened, and the speed and the efficiency of data query are further improved.
According to the data distribution and parallel processing method provided by the embodiment of the invention, all server nodes in the database are divided into the work distribution nodes and the data nodes, so that the work distribution nodes can manage task distribution of the data nodes and return of query results; in this way, the data nodes can achieve load balancing according to different use states. The consistent synchronization of data in the server nodes can be ensured by keeping the data information table updated. By backing up the copy of the query table on each server node, each server node can execute the query instruction concurrently, so that the corresponding time is shortened, and the query efficiency is improved. By calling the adaptive backup to compare with the current query task when the concurrency conflicts, the consistency of the distributed data can be ensured. Therefore, the data distribution and parallel processing method provided by the invention not only can ensure the consistency of the data, but also can improve the speed and efficiency of data query. Meanwhile, based on the steps, the invention also overcomes the problem of concurrency conflict in the query, so that the whole query process is more stable and reliable.
In some alternative embodiments of the present invention, the database needs to be first divided into several subsets according to preset rules, for example: uniformly dividing all data in a database into disjoint n equal parts according to columns, wherein each part consists of a plurality of columns; or, dividing all data into n disjoint equal parts according to rows, wherein each part consists of a plurality of rows; or, dividing the data into n disjoint subsets according to the similarity of the data, for example, the data larger than the average value is classified into the same subset, and other data is classified into another subset; or, dividing the data into n disjoint subsets according to the difference of the data, for example, the data with variance larger than a certain constant k in the column is classified into the same subset, and other data is classified into another subset; this allows the entire database to be divided into several individuals, which can be independent of each other, and subsets, each subset being processed on the same server, for a total of n servers. Enabling independent decisions to be made on each server, giving a decision whether the current task should be aborted or submitted. The principle of judgment is as follows: the query task may be submitted when there is no concurrency conflict, and if there is a concurrency conflict, for example, a file being written by one program, it will not be possible to be written by another program at the same time. And calling the self-adaptive copy backup of the subset on the current server, comparing and verifying the self-adaptive copy backup with the current task, and giving a decision.
The self-adaptive backup refers to that important data are backed up on a plurality of servers in a distributed system for disaster tolerance and fault tolerance. There are many methods for adaptive copy backup, including data block backup, dynamic backup, etc. If data in a plurality of servers in distributed deployment is required to be consistent, writing operation of data in all the servers is guaranteed, and all the writing operation is not executed or all the writing operation is not executed. However, one server cannot know the execution result of the local transaction in the other server when executing the local transaction. It is not known whether the transaction should be aborted or committed at all. If there is a concurrency conflict on a subset, i.e. it is occupied by the write operation of a program, and other programs cannot execute the write operation on the current subset, the present invention proposes to replace executing the local transaction on the current subset by accessing the backup of the current subset, so that the local transaction execution results of all servers can be obtained, and a decision is given, and the specific method is as follows: if the transaction operation of all the participating subsets of the task is executed successfully, returning to the decision of submitting the query task; if the transaction operation for all participating subsets of the task actually fails to execute, it returns a decision to revoke the query task. Thus, MPP-based shared-nothing resource architectures require global deadlock detection mechanisms, two-phase commit protocols, etc. to ensure integrity and recovery of transactions. The scheme adopted by the invention does not need the time-consuming methods, so that the speed of the query task is increased.
In some optional embodiments of the present invention, as shown in fig. 2, the step 106 of, when it is detected that there is a concurrency conflict, concurrently executing the real query instruction by invoking the copy of the lookup table on the current server node further includes:
step 201, detecting the real query instruction in each server node, determining whether a concurrency conflict exists, if not, executing step 202, and if so, executing step 203;
step 202, according to step 201, if there is no concurrency conflict, submitting a query task; the query task is a query task corresponding to a real query instruction;
step 203, according to step 201, if there is a concurrency conflict, calling a self-adaptive backup on the current server node to compare with the current query task, and executing the query task on the self-adaptive backup;
step 204, judging whether the query tasks in all the server nodes are successfully executed, if so, executing step 205, otherwise, executing step 206;
step 205, according to step 204, if the query tasks in all the server nodes are successfully executed, returning to the decision of submitting the query tasks;
step 206, according to step 204, if there is an unsuccessful query task, a decision to withdraw the query task is returned.
Thus, aiming at the problem of concurrency conflict in the server node, the query task can be executed by calling the self-adaptive backup. Specifically, each server has a self-adaptive backup record information, when there is a query command concurrency conflict and the self-adaptive backup needs to be called, the current server can determine the storage position of the record by accessing the self-adaptive backup record on the current server, that is, the self-adaptive backup is stored in which server node, and then the self-adaptive backup is compared with the current query task and a decision is given, so that the query task of executing the self-adaptive backup data on the current server is realized, the data consistency is further maintained, meanwhile, the whole query task can be effectively executed concurrently, and the speed and the efficiency of data query are further greatly improved.
In some optional embodiments of the present invention, the method for dividing all server nodes in the database into the work distribution node and the data node according to a preset function includes:
detecting the local storage capacity and the computing processing capacity of each server node;
judging whether the calculation processing capacity of the current server node is greater than a preset performance threshold value, if so, setting the current server node as a work distribution node, otherwise, setting the current server node as a data node;
or,
judging whether the local storage capacity of the current server node is greater than a preset storage threshold value or not, if so, setting the current server node as a data node; otherwise, the current server node is set as the working distribution node.
Therefore, different server nodes can correspondingly realize different functions according to the characteristics of the server nodes, and the resources of the server can be utilized to the maximum extent. Meanwhile, the distribution mode enables the management of the database to be more orderly and efficient.
In some optional embodiments of the present invention, the step of sending the real query instruction to a work distribution node, and distributing the real query instruction to different data nodes through the work distribution node further includes:
detecting the specific use condition of each server node;
and distributing data nodes and corresponding spaces on each server node by adopting a preset balance strategy according to the detection result.
Wherein the specific use cases comprise: the communication condition of the server, the storage space utilization rate of each data node and other conditions or data related to the performance of the server nodes can realize different query task distribution according to different conditions of the server nodes. The processing speed and efficiency of the task can be optimized.
Further, in some optional embodiments, the balancing policy is a lowest space usage policy.
In some optional embodiments of the invention, the step of maintaining the update of the data information table at each server node further comprises: and copying and storing the position information and the related operation information of the data information table. In this way, the position information and the related operation information of the data information table are saved, so that information recording can be realized for all data processing, and the user can realize processing such as withdrawing operation according to the information.
In some optional embodiments of the invention, the step of maintaining the update of the data information table at each server node further comprises:
all data in each data node are backed up to the designated data node; namely, the safety of the data is ensured through the backup data, and the self-adaptive backup data is obtained.
Keeping all data in each data node and synchronous updating of corresponding data backup; by synchronous updating of the saved data, all data in each data node and the backed-up data are kept consistent.
The position information of each data node is stored on each data node, so that the data node can be found through the position information, and the withdrawal operation is convenient to realize.
Location information for each data node may be maintained in the database by an update of the primary database copy specifying the database address. All information of the original data can be stored in the data nodes, and according to the information, the retrieval of the backup data is carried out in the data of the corresponding data nodes or backup data nodes.
In some optional embodiments of the present invention, the step of parsing the initial query instruction further includes:
performing lexical analysis on characters input by a user side to obtain words meeting the standard;
carrying out syntactic analysis on a plurality of continuous words to obtain sentences conforming to syntactic logic, and simultaneously constructing to obtain an abstract syntactic tree;
and converting the logic SQL statement into a real SQL statement according to the abstract syntax tree to obtain a real query statement which meets the standard and is used as a real query instruction.
Therefore, the real query instruction can be obtained through the gradual analysis of the initial query instruction, and the subsequent data query task is further executed.
Optionally, in the data distribution and parallel processing method of the present invention, a copy of the lookup table is pre-stored on each server node, so that when a concurrency conflict exists in the query instruction in the current server node, the current server node cannot execute the query instruction, and after a result that the processing node cannot query is returned to, the processing node finds the backup server where the adaptive backup corresponding to the current server data is located through the data information table, and then instructs the backup server to query the backup data according to the query instruction in the copy of the lookup table, thereby completing the query of the data.
Referring to fig. 3, a flowchart of an embodiment of a data distribution and parallel processing system according to the present invention is shown. The data distribution and parallel processing system comprises:
an input unit 301, configured to receive an initial query instruction input by a user end;
an analyzing unit 302, configured to analyze the initial query instruction to obtain a real query statement meeting a standard, and use the real query statement as a real query instruction;
the distributing unit 303 is configured to send the real query instruction to a work distribution node, and distribute the real query instruction to different data nodes through the work distribution node; the work distribution nodes and the data nodes are nodes which are divided by all server nodes in the database according to preset functions; the work distribution node is used for managing task distribution and query results of the data nodes;
the processing unit 304 is configured to execute the received real query instruction on each data node, return a query result to the work distribution node, and keep updating the data information table on each server node;
a backup unit 305, configured to backup, on each server node, a lookup table corresponding to the real query instruction, so that a lookup table copy is stored on each server node;
and the concurrency unit 306 is used for executing the real query instruction concurrently by calling the copy of the query table on the current server node when the concurrency conflict is detected on each server node.
As can be seen from the foregoing embodiments, the data distribution and parallel processing system according to the present invention backs up the lookup table through the backup unit 305, and then concurrently processes the query task through the concurrent unit 306, so that the data distribution and parallel processing system can not only ensure data consistency, but also improve data query speed and efficiency.
Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the idea of the invention, also features in the above embodiments or in different embodiments may be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the invention as described above, which are not provided in detail for the sake of brevity.
In addition, well known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown within the provided figures for simplicity of illustration and discussion, and so as not to obscure the invention. Furthermore, devices may be shown in block diagram form in order to avoid obscuring the invention, and also in view of the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the present invention is to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the invention, it should be apparent to one skilled in the art that the invention can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.
While the present invention has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those of ordinary skill in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic ram (dram)) may use the discussed embodiments.
The embodiments of the invention are intended to embrace all such alternatives, modifications and variances that fall within the broad scope of the appended claims. Therefore, any omissions, modifications, substitutions, improvements and the like that may be made without departing from the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims (9)

1. A data distribution and parallel processing method is characterized by comprising the following steps:
receiving an initial query instruction input by a user side;
analyzing the initial query instruction to obtain a real query statement which meets the standard and is used as a real query instruction;
sending the real query instruction to a work distribution node, and distributing the real query instruction to different data nodes through the work distribution node; the work distribution nodes and the data nodes are nodes which are divided by all server nodes in the database according to preset functions; the work distribution node is used for managing task distribution and query results of the data nodes;
executing the received real query instruction on each data node respectively, returning a query result to the work distribution node, and keeping the update of the data information table on each server node; the data information table is used for recording the update of data;
on each server node, backing up a query table corresponding to the real query instruction, so that a query table copy is stored on each server node; the query table is a data table for storing the obtained query result after the real query instruction is subjected to instruction query;
on each server node, when a concurrency conflict is detected, the real query instruction is executed concurrently by calling the copy of the query table on the current server node.
2. The method of claim 1, wherein the step of concurrently executing the real query instruction by invoking the copy of the lookup table on the current server node when the presence of the concurrency conflict is detected comprises:
detecting whether the real inquiry command has concurrency conflict in each server node;
if the concurrency conflict does not exist, submitting the query task; the query task is a query task corresponding to a real query instruction;
if the concurrency conflict exists, calling the query table backup on the current server node to be compared with the current query task, and executing the query task on the query table backup;
and if the query tasks in all the server nodes are successfully executed, returning a decision for submitting the query tasks, and otherwise, returning a decision for withdrawing the query tasks.
3. The method of claim 1, wherein the step of dividing all server nodes in the database into work distribution nodes and data nodes according to a preset function comprises:
detecting the local storage capacity and the computing processing capacity of each server node;
judging whether the calculation processing capacity of the current server node is greater than a preset performance threshold value, if so, setting the current server node as a work distribution node, otherwise, setting the current server node as a data node;
or,
judging whether the local storage capacity of the current server node is greater than a preset storage threshold value or not, if so, setting the current server node as a data node; otherwise, the current server node is set as the working distribution node.
4. The method of claim 1, wherein the step of sending the real query instructions to a work distribution node and distributing the real query instructions to different data nodes through the work distribution node further comprises:
detecting the specific use condition of each server node;
and distributing data nodes and corresponding spaces on each server node by adopting a preset balance strategy according to the detection result.
5. The method of claim 4, wherein the balancing policy is a least space usage policy.
6. The method of claim 1, wherein the step of maintaining an update of the data information table at each server node further comprises:
and copying and storing the position information and the related operation information of the data information table.
7. The method of claim 1, wherein the step of maintaining an update of the data information table at each server node further comprises:
all data in each data node are backed up to the designated data node;
keeping all data in each data node and synchronous updating of corresponding data backup;
the location information of each data node is stored on each data node.
8. The method of claim 1, wherein the step of parsing the initial query instruction further comprises:
performing lexical analysis on characters input by a user side to obtain words meeting the standard;
carrying out syntactic analysis on a plurality of continuous words to obtain sentences conforming to syntactic logic, and simultaneously constructing to obtain an abstract syntactic tree;
and converting the logic SQL statement into a real SQL statement according to the abstract syntax tree to obtain a real query statement which meets the standard and is used as a real query instruction.
9. A data distribution and parallel processing system, comprising:
the input unit is used for receiving an initial query instruction input by a user side;
the analysis unit is used for analyzing the initial query instruction to obtain a real query statement meeting the standard and using the real query statement as a real query instruction;
the distribution unit is used for sending the real query instruction to a work distribution node and distributing the real query instruction to different data nodes through the work distribution node; the work distribution nodes and the data nodes are nodes which are divided by all server nodes in the database according to preset functions; the work distribution node is used for managing task distribution and query results of the data nodes;
the processing unit is used for respectively executing the received real query instruction on each data node, returning a query result to the work distribution node, and simultaneously keeping the update of the data information table on each server node; the data information table is used for recording the update of data;
the backup unit is used for backing up the query table corresponding to the real query instruction on each server node, so that a query table copy is stored on each server node; the query table is a data table for storing the obtained query result after the real query instruction is subjected to instruction query;
and the concurrency unit is used for executing the real query instruction in a concurrent manner by calling the copy of the query table on the current server node when the concurrency conflict is detected on each server node.
CN201610956986.9A 2016-10-25 2016-10-25 A kind of data distribution and method for parallel processing and system Active CN106569896B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610956986.9A CN106569896B (en) 2016-10-25 2016-10-25 A kind of data distribution and method for parallel processing and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610956986.9A CN106569896B (en) 2016-10-25 2016-10-25 A kind of data distribution and method for parallel processing and system

Publications (2)

Publication Number Publication Date
CN106569896A CN106569896A (en) 2017-04-19
CN106569896B true CN106569896B (en) 2019-02-05

Family

ID=58535786

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610956986.9A Active CN106569896B (en) 2016-10-25 2016-10-25 A kind of data distribution and method for parallel processing and system

Country Status (1)

Country Link
CN (1) CN106569896B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107369088B (en) * 2017-07-19 2021-05-07 中国银行股份有限公司 Method and device for processing account transaction
CN108228752B (en) * 2017-12-21 2022-04-15 中国联合网络通信集团有限公司 Data total export method, data export task allocation device and data export node device
CN109684050B (en) * 2018-12-06 2021-02-19 成都佰纳瑞信息技术有限公司 Application method of parallelization transaction executor
CN110928550B (en) * 2019-11-19 2023-11-24 上海工程技术大学 Method for eliminating GCC abstract syntax tree redundancy based on keyword Trie tree
CN113360540A (en) * 2020-03-04 2021-09-07 上海分泽时代软件技术有限公司 Method for inquiring and distributing outdoor time-sharing advertisement inventory in real time
CN112346879B (en) * 2020-11-06 2023-08-11 网易(杭州)网络有限公司 Process management method, device, computer equipment and storage medium
CN117251294B (en) * 2023-11-14 2024-03-08 广东仁达科技有限公司 Data real-time distribution multichannel parallel system based on database replica technology

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102201010A (en) * 2011-06-23 2011-09-28 清华大学 Distributed database system without sharing structure and realizing method thereof
CN102663117A (en) * 2012-04-18 2012-09-12 中国人民大学 OLAP (On Line Analytical Processing) inquiry processing method facing database and Hadoop mixing platform
CN102693324A (en) * 2012-01-09 2012-09-26 西安电子科技大学 Distributed database synchronization system, synchronization method and node management method
CN102906751A (en) * 2012-07-25 2013-01-30 华为技术有限公司 Method and device for data storage and data query
CN103577407A (en) * 2012-07-19 2014-02-12 国际商业机器公司 Query method and query device for distributed database
CN104978327A (en) * 2014-04-03 2015-10-14 中国移动通信集团河南有限公司 Data query method, management control node and target data node

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7657581B2 (en) * 2004-07-29 2010-02-02 Archivas, Inc. Metadata management for fixed content distributed data storage

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102201010A (en) * 2011-06-23 2011-09-28 清华大学 Distributed database system without sharing structure and realizing method thereof
CN102693324A (en) * 2012-01-09 2012-09-26 西安电子科技大学 Distributed database synchronization system, synchronization method and node management method
CN102663117A (en) * 2012-04-18 2012-09-12 中国人民大学 OLAP (On Line Analytical Processing) inquiry processing method facing database and Hadoop mixing platform
CN103577407A (en) * 2012-07-19 2014-02-12 国际商业机器公司 Query method and query device for distributed database
CN102906751A (en) * 2012-07-25 2013-01-30 华为技术有限公司 Method and device for data storage and data query
CN104978327A (en) * 2014-04-03 2015-10-14 中国移动通信集团河南有限公司 Data query method, management control node and target data node

Also Published As

Publication number Publication date
CN106569896A (en) 2017-04-19

Similar Documents

Publication Publication Date Title
CN106569896B (en) A kind of data distribution and method for parallel processing and system
CN111338766B (en) Transaction processing method and device, computer equipment and storage medium
Mohebi et al. Iterative big data clustering algorithms: a review
US7734615B2 (en) Performance data for query optimization of database partitions
Wang et al. Multi-query optimization in mapreduce framework
US9619430B2 (en) Active non-volatile memory post-processing
Bakshi Considerations for big data: Architecture and approach
US9747127B1 (en) Worldwide distributed job and tasks computational model
Olston et al. Automatic optimization of parallel dataflow programs
Băzăr et al. The Transition from RDBMS to NoSQL. A Comparative Analysis of Three Popular Non-Relational Solutions: Cassandra, MongoDB and Couchbase.
US7822710B1 (en) System and method for data collection
Hasani et al. Lambda architecture for real time big data analytic
Sumithra et al. Using distributed apriori association rule and classical apriori mining algorithms for grid based knowledge discovery
CN103399894A (en) Distributed transaction processing method on basis of shared storage pool
US10789087B2 (en) Insight usage across computing nodes running containerized analytics
US10599614B1 (en) Intersection-based dynamic blocking
Bidoit et al. Processing XML queries and updates on map/reduce clusters
WO2015065369A1 (en) Asynchronous garbage collection in a distributed database system
Li Modernization of databases in the cloud era: Building databases that run like Legos
US10552419B2 (en) Method and system for performing an operation using map reduce
US7296013B2 (en) Replacing an unavailable element in a query
US11940972B2 (en) Execution of operations on partitioned tables
Raikhlin et al. Clusterix-Like BigData DBMS
Kim et al. Dynamic partition lock method to reduce transaction abort rates of cloud database
Dai et al. The Hadoop stack: new paradigm for big data storage and processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20180523

Address after: 100070 the 28 tier of fortune Fortune Plaza, No.1, hang Feng Road, Fengtai District, Beijing.

Applicant after: BEIJING GUODIANTONG NETWORK TECHNOLOGY Co.,Ltd.

Address before: 100070 the 28 tier of fortune Fortune Plaza, No.1, hang Feng Road, Fengtai District, Beijing.

Applicant before: BEIJING GUODIANTONG NETWORK TECHNOLOGY Co.,Ltd.

Applicant before: State Grid Corporation of China

Applicant before: STATE GRID ZHEJIANG ELECTRIC POWER Co.

Applicant before: Beijing Zhongdian Feihua Communications Co.,Ltd.

Applicant before: STATE GRID INFORMATION & TELECOMMUNICATION GROUP Co.,Ltd.

Applicant before: STATE GRID JIBEI ELECTRIC POWER Co.,Ltd.

Applicant before: STATE GRID EAST INNER MONGOLIA ELECTRIC POWER Co.,Ltd.

Applicant before: BEIJING GREAT OPENSOURCE SOFTWARE Co.,Ltd.

TA01 Transfer of patent application right
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20181119

Address after: 100070 the 28 tier of fortune Fortune Plaza, No.1, hang Feng Road, Fengtai District, Beijing.

Applicant after: BEIJING GUODIANTONG NETWORK TECHNOLOGY Co.,Ltd.

Applicant after: State Grid Corporation of China

Applicant after: STATE GRID ZHEJIANG ELECTRIC POWER Co.

Applicant after: Beijing Zhongdian Feihua Communications Co.,Ltd.

Applicant after: STATE GRID INFORMATION & TELECOMMUNICATION GROUP Co.,Ltd.

Applicant after: STATE GRID JIBEI ELECTRIC POWER Co.,Ltd.

Applicant after: STATE GRID EAST INNER MONGOLIA ELECTRIC POWER Co.,Ltd.

Applicant after: BEIJING GREAT OPENSOURCE SOFTWARE Co.,Ltd.

Address before: 100070 the 28 tier of fortune Fortune Plaza, No.1, hang Feng Road, Fengtai District, Beijing.

Applicant before: BEIJING GUODIANTONG NETWORK TECHNOLOGY Co.,Ltd.

GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder

Address after: 100070 Fengtai District, Feng Feng Road, the era of wealth on the 1st floor of the world's 28 floor, Beijing

Co-patentee after: STATE GRID CORPORATION OF CHINA

Patentee after: BEIJING GUODIANTONG NETWORK TECHNOLOGY Co.,Ltd.

Co-patentee after: STATE GRID ZHEJIANG ELECTRIC POWER Co.,Ltd.

Co-patentee after: Beijing Zhongdian Feihua Communications Co.,Ltd.

Co-patentee after: STATE GRID INFORMATION & TELECOMMUNICATION GROUP Co.,Ltd.

Co-patentee after: STATE GRID JIBEI ELECTRIC POWER Co.,Ltd.

Co-patentee after: STATE GRID EAST INNER MONGOLIA ELECTRIC POWER Co.,Ltd.

Co-patentee after: BEIJING GREAT OPENSOURCE SOFTWARE Co.,Ltd.

Address before: 100070 Fengtai District, Feng Feng Road, the era of wealth on the 1st floor of the world's 28 floor, Beijing

Co-patentee before: State Grid Corporation of China

Patentee before: BEIJING GUODIANTONG NETWORK TECHNOLOGY Co.,Ltd.

Co-patentee before: STATE GRID ZHEJIANG ELECTRIC POWER Co.

Co-patentee before: Beijing Zhongdian Feihua Communications Co.,Ltd.

Co-patentee before: STATE GRID INFORMATION & TELECOMMUNICATION GROUP Co.,Ltd.

Co-patentee before: STATE GRID JIBEI ELECTRIC POWER Co.,Ltd.

Co-patentee before: STATE GRID EAST INNER MONGOLIA ELECTRIC POWER Co.,Ltd.

Co-patentee before: BEIJING GREAT OPENSOURCE SOFTWARE Co.,Ltd.

CP01 Change in the name or title of a patent holder
TR01 Transfer of patent right

Effective date of registration: 20190712

Address after: 100085 Beijing city Haidian District Qinghe small Camp Road No. 15

Co-patentee after: STATE GRID CORPORATION OF CHINA

Patentee after: BEIJING CHINA POWER INFORMATION TECHNOLOGY Co.,Ltd.

Co-patentee after: STATE GRID ZHEJIANG ELECTRIC POWER Co.,Ltd.

Co-patentee after: Beijing Zhongdian Feihua Communications Co.,Ltd.

Co-patentee after: STATE GRID INFORMATION & TELECOMMUNICATION GROUP Co.,Ltd.

Co-patentee after: STATE GRID JIBEI ELECTRIC POWER Co.,Ltd.

Co-patentee after: STATE GRID EAST INNER MONGOLIA ELECTRIC POWER Co.,Ltd.

Co-patentee after: BEIJING GREAT OPENSOURCE SOFTWARE Co.,Ltd.

Address before: 100070 Fengtai District, Feng Feng Road, the era of wealth on the 1st floor of the world's 28 floor, Beijing

Co-patentee before: STATE GRID CORPORATION OF CHINA

Patentee before: BEIJING GUODIANTONG NETWORK TECHNOLOGY Co.,Ltd.

Co-patentee before: STATE GRID ZHEJIANG ELECTRIC POWER Co.,Ltd.

Co-patentee before: Beijing Zhongdian Feihua Communications Co.,Ltd.

Co-patentee before: STATE GRID INFORMATION & TELECOMMUNICATION GROUP Co.,Ltd.

Co-patentee before: STATE GRID JIBEI ELECTRIC POWER Co.,Ltd.

Co-patentee before: STATE GRID EAST INNER MONGOLIA ELECTRIC POWER Co.,Ltd.

Co-patentee before: BEIJING GREAT OPENSOURCE SOFTWARE Co.,Ltd.

TR01 Transfer of patent right