CN114691051B - Data processing method and device - Google Patents

Data processing method and device Download PDF

Info

Publication number
CN114691051B
CN114691051B CN202210599316.1A CN202210599316A CN114691051B CN 114691051 B CN114691051 B CN 114691051B CN 202210599316 A CN202210599316 A CN 202210599316A CN 114691051 B CN114691051 B CN 114691051B
Authority
CN
China
Prior art keywords
data
read
shared memory
reading
data reading
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210599316.1A
Other languages
Chinese (zh)
Other versions
CN114691051A (en
Inventor
钱山
李森
吴凯
刘芃
张绍震
闫长虎
秦元
杨静
章利君
叶约翰
郑洁锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hundsun Technologies Inc
Original Assignee
Hundsun Technologies Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hundsun Technologies Inc filed Critical Hundsun Technologies Inc
Priority to CN202210599316.1A priority Critical patent/CN114691051B/en
Publication of CN114691051A publication Critical patent/CN114691051A/en
Application granted granted Critical
Publication of CN114691051B publication Critical patent/CN114691051B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • G06F3/0611Improving I/O performance in relation to response time
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/544Buffers; Shared memory; Pipes

Abstract

An embodiment of the present specification provides a data processing method and an apparatus, where the data processing method includes: receiving a data reading task, wherein the data reading task carries a data identifier to be read; generating and sending a data reading request according to the to-be-read data identifier; receiving a response message aiming at the data reading request, wherein the response message is used for indicating that the writing of the data corresponding to the to-be-read data identifier into the shared memory is completed; and reading data from the shared memory according to the response message. And reading data from the shared memory through interaction of the data reading request and the response message so as to improve the speed and efficiency of reading the data and further improve the communication efficiency.

Description

Data processing method and device
Technical Field
The embodiment of the specification relates to the technical field of data processing, in particular to a data processing method and device.
Background
With the continuous development of the big data era, people pay more and more attention to the efficiency of processing and analyzing data, and then choose a large-scale parallel data processing system to analyze and process the data in the database, and the efficiency is improved through a parallel computing mode. Before data processing and analysis, data is read from the data storage unit, and data transmission is performed through network communication on the data storage unit corresponding to each working node in the massively parallel data processing system.
However, in the process of data transmission through network communication, since the network communication is limited by the network protocol, the data transmission is often limited by the upper limit of the data transmission, and the data communication efficiency is further reduced.
Disclosure of Invention
In view of this, the embodiments of the present specification provide a data processing method. One or more embodiments of the present specification relate to a data processing apparatus, a data processing system, a computing device, and a computer-readable storage medium to solve technical problems in the prior art.
According to a first aspect of embodiments herein, there is provided a method of data processing, comprising:
receiving a data reading task, wherein the data reading task carries a data identifier to be read;
generating and sending a data reading request according to the to-be-read data identifier;
receiving a response message aiming at the data reading request, wherein the response message is used for indicating that the writing of the data corresponding to the to-be-read data identifier into the shared memory is completed;
and reading data from the shared memory according to the response message.
Optionally, the receiving a response message for the data read request includes:
and receiving a response message aiming at the data reading request under the condition that the data reading task comprises a plurality of data reading events, wherein the response message is used for indicating that the writing of target data corresponding to a target data reading event in the plurality of data reading events into the shared memory is completed.
Optionally, when the response message carries a storage address of the data corresponding to the identifier of the data to be read in the shared memory, the reading the data from the shared memory according to the response message includes:
determining target data which carries the identification of the data to be read and corresponds to the target data reading event according to the response message; and reading the target data from the shared memory according to the storage address in the response message.
Optionally, after the step of receiving a response message for the data read request, the method further includes: and under the condition that the response message carries a traversal completion signal, switching the state of the data reading task to a completion state according to the traversal completion signal.
Optionally, before the step of reading the data from the shared memory according to the response message, the method further includes: and establishing a data transmission relation with the shared memory, and executing the step of reading data from the shared memory according to the response message.
According to a second aspect of embodiments herein, there is provided an apparatus for data processing, comprising:
the first receiving submodule is configured to receive a data reading task, wherein the data reading task carries a to-be-read data identifier;
the sending submodule is configured to generate and send a data reading request according to the to-be-read data identifier;
the second receiving submodule is configured to receive a response message for the data reading request, wherein the response message is used for indicating that the writing of the data corresponding to the to-be-read data identifier into the shared memory is completed;
and the reading submodule is configured to read data from the shared memory according to the response message.
According to a third aspect of embodiments herein, there is provided a method of data processing, comprising:
receiving a data reading request, wherein the data reading request carries a data identifier to be read;
writing data corresponding to the data identifier to be read into a shared memory according to the data identifier to be read;
and sending a response message aiming at the data reading request, wherein the response message is used for indicating that the writing of the data corresponding to the data identifier to be read into the shared memory is completed.
Optionally, before the step of writing the data corresponding to the to-be-read data identifier into the shared memory according to the to-be-read data identifier, the method further includes: registering a shared memory identifier, the shared memory identifier pointing to a process address space in a virtual memory; and taking the process address space as the shared memory.
Optionally, after the step of receiving a data read request, the method further includes:
decomposing the data reading request into a plurality of data reading events, wherein each data reading event carries a data identifier to be read; the writing the data corresponding to the identifier of the data to be read into the shared memory according to the identifier of the data to be read includes: determining a target data read event of the plurality of data read events; and writing the target data corresponding to the target data reading event into the shared memory according to the to-be-read data identifier carried in the target data reading event.
Optionally, the step of sending a response message for the data read request further includes:
and sending a response message aiming at the data reading request under the condition that all the data corresponding to the data reading events are written into the shared memory, wherein the response message carries a traversal completion signal.
Optionally, the writing the data corresponding to the to-be-read data identifier into a shared memory includes: determining the storage position relation of the data corresponding to the data identifier to be read; and writing the data corresponding to the data identifier to be read into the shared memory in rows based on the storage position relation.
According to a fourth aspect of embodiments herein, there is provided an apparatus for data processing, comprising:
the receiving submodule is configured to receive a data reading request, wherein the data reading request carries a to-be-read data identifier;
the writing submodule is configured to write data corresponding to the to-be-read data identifier into a shared memory according to the to-be-read data identifier;
and the sending submodule is configured to send a response message for the data reading request, wherein the response message is used for indicating that the data corresponding to the to-be-read data identifier is completely written into the shared memory.
According to a fifth aspect of embodiments herein, there is provided a data processing system comprising:
the data reading module and the data storage module;
the data reading module is used for receiving a data reading task carrying a data identifier to be read, generating a data reading request according to the data identifier to be read, and sending the data reading request to the data storage module;
the data storage module is configured to receive the data read request carrying the identifier of the data to be read, respond to the data read request, write data corresponding to the identifier of the data to be read into a shared memory, and send a response message to the data read request to the data read module, where the response message is used to indicate that writing of the data corresponding to the identifier of the data to be read into the shared memory is completed;
the data reading module is further configured to receive the response message, and read data from the shared memory according to the response message.
Optionally, the data reading module and the data storage module are communicatively connected through a remote procedure call protocol.
Optionally, the method further comprises:
the coordination module is used for acquiring a data reading request submitted by a client and analyzing the data reading request to determine an analysis result;
the execution plan module is used for determining a data reading task according to the analysis result;
and the scheduling task module is used for distributing the data reading task to the data reading module, wherein the data reading task carries the identification of the data to be read.
According to a sixth aspect of embodiments herein, there is provided a computing device comprising:
a memory and a processor;
the memory is for storing computer-executable instructions and the processor is for executing the computer-executable instructions, which when executed by the processor, implement the steps of the above-described method.
According to a seventh aspect of embodiments herein, there is provided a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the above-described method.
One embodiment of the present specification provides a data processing method, which receives a data reading task, where the data reading task carries a to-be-read data identifier, and the to-be-read data identifier may be used to identify data to be read, so as to quickly determine the data to be read, thereby improving data reading efficiency; then, generating and sending a data reading request according to the to-be-read data identifier; receiving a response message aiming at the data reading request, wherein the response message is used for indicating that the writing of the data corresponding to the to-be-read data identifier into the shared memory is completed; through the interaction between the data reading request and the response message, the data reading module can quickly read data from the shared memory, so that the process of searching a data storage address in the shared memory is avoided, the data reading time is reduced, and the data reading efficiency is improved; and then, reading data from the shared memory according to the response message, and further completing the process of accessing the shared memory to read the data by the data reading module, so that the data reading module does not need to perform data transmission with the data storage module through network communication, the limitation of a network protocol during the network communication is avoided, and further the communication efficiency is improved.
Drawings
FIG. 1 is a flow chart of a data processing method provided by an embodiment of the present description;
fig. 2 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present specification;
FIG. 3 is a flow chart of a data processing method provided in another embodiment of the present description;
fig. 4 is a schematic structural diagram of a data processing apparatus according to another embodiment of the present disclosure;
FIG. 5 is a block diagram of a data processing system, according to an embodiment of the present disclosure;
FIG. 6 is a flow chart of a data read + write method provided by an embodiment of the present description;
FIG. 7 is a schematic structural diagram illustrating a data reading + writing method according to an embodiment of the present disclosure;
fig. 8 is a block diagram of a computing device according to an embodiment of the present disclosure.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present specification. This description may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein, as those skilled in the art will be able to make and use the present disclosure without departing from the spirit and scope of the present disclosure.
The terminology used in the description of the one or more embodiments is for the purpose of describing the particular embodiments only and is not intended to be limiting of the description of the one or more embodiments. As used in one or more embodiments of the present specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present specification refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first can also be referred to as a second and, similarly, a second can also be referred to as a first without departing from the scope of one or more embodiments of the present description. The word "if," as used herein, may be interpreted as "at … …" or "at … …" or "in response to a determination," depending on the context.
First, the noun terms to which one or more embodiments of the present specification relate are explained.
OLAP, which is called On-Line Analytical Processing in English, is an online analysis Processing technology, is a main application form of a data warehouse system, helps analysts analyze data in multiple angles, excavates data value, and is mainly used for supporting complex analysis operation, emphasizing decision support and providing intuitive and understandable query results.
The MPP, which is called Massively Parallel Processing throughout english, is a Massively Parallel Processing system, and is used to distribute tasks to a plurality of servers and nodes in Parallel, and after the computation is completed on each node, the results of respective parts are collected together to obtain a final result.
Kudu, which can be used as a data storage unit of the MPP engine for storing data.
RPC, also known as Remote Procedure Call, is a Remote Procedure Call protocol.
The process address space is composed of a virtual memory addressable by the process, and the kernel allows the process to use the address in the virtual memory, defines a memory space and belongs to a section of the activity range of the current process.
SQL, known collectively as Structured Query Language, is a Structured Query programming Language for accessing data and querying, updating, and managing relational database systems.
With the continuous development of the big data era, people pay more and more attention to the efficiency of processing and analyzing data, and then choose a large-scale parallel data processing system to analyze and process the data in the database, and the efficiency is improved through a parallel computing mode. Before data processing and analysis, data is read from the data storage unit, and data transmission is performed through network communication on the data storage unit corresponding to each working node in the massively parallel data processing system.
However, in the process of data transmission through network communication, since the network communication is limited by the network protocol, the data transmission is often limited by the upper limit of the data transmission, and the data communication efficiency is further reduced.
In practical application, when big data is processed and analyzed, a design idea of a distributed architecture is generally adopted, the processing efficiency is improved through a parallel computing mode, and meanwhile, the expansion capability is achieved. In some storage engines, static data is mainly stored, which is suitable for a high-throughput offline big data analysis scenario, but these storage engines have limitations that data cannot be randomly read and written and data change is performed. In other storage engines, dynamic data is mainly stored, and the method is suitable for random read-write scenes of mass data, but the storage engines have the limitations of poor batch read performance and inapplicability to offline analysis of batch data.
Based on the limitations of the storage engine, kudu provides simple data processing capabilities, such as insertion, update, and deletion of data, and provides simple data analysis capabilities as a data storage unit of the MPP engine. It can be understood that Kudu is set on each working node of the MPP engine as a data storage unit, so that the working nodes of the MPP read data from the Kudu data storage unit based on network communication, the above limitation can be avoided, the reading performance is improved, and the data analysis function is expanded.
However, even though the Kudu data storage unit is arranged at each MPP working node, the limitation of the network transmission bandwidth of the physical network card is avoided, and the domain socket is adopted by the network protocol to replace the network socket, so that the network protocol layers are reduced, and the communication efficiency is improved.
However, the network communication between the MPP working node and the Kudu data storage unit is limited by the network protocol, and the limitation of data transmission upper limit still exists. And when the MPP working node directly reads data from the Kudu data storage unit through network communication, the data is serialized into direct flow in the Kudu data storage unit and then transmitted to the MPP working node, the direct flow is deserialized into a data object at the MPP working node, and the serialization and deserialization processes occupy larger CPU computing resources.
Based on this, an embodiment of the present specification provides a data processing method, and it is understood that the data processing method may be understood as a data reading method, which can quickly read data from a shared memory through interaction between a data reading request and a response message, so as to improve the efficiency of reading data, avoid using network communication, further avoid processing procedures of serialization and deserialization, and prevent occupying large CPU computing resources.
Furthermore, in the embodiment of the present disclosure, a scenario in which the MPP node reads data is taken as an example to describe in detail, it may be understood that the MPP node is used for reading data and is a data reading module of a data processing engine, and the MPP node sends a data reading request, receives a response message for the data reading request, and reads data from the shared memory according to the response message.
In the present specification, a data processing method is provided, and the present specification relates to a data processing apparatus, another data processing method, another data processing apparatus, a data processing system, a computing device, and a computer-readable storage medium, which are described in detail one by one in the following embodiments.
Referring to fig. 1, fig. 1 shows a flowchart of a data processing method according to an embodiment of the present specification, which specifically includes the following steps.
Step 102, receiving a data reading task, wherein the data reading task carries a data identifier to be read.
Specifically, the data reading task may be a work task issued by the data processing engine, or may also be a work task issued by the client, where the data reading task includes data information to be read. The identifier of the data to be read may be understood as identification information corresponding to the data to be read, and may be used to identify the data to be read.
It can be understood that the data processing method provided in this embodiment is applied to the data reading module, and the data reading module can access the shared memory and read the required data from the shared memory, so as to improve the efficiency of data reading and avoid the limitation of a network protocol in network communication.
Based on this, in the embodiment of the present specification, the data reading module receives a data reading task issued by the data processing engine, where the data reading task carries a to-be-read data identifier for identifying data to be read, and according to the to-be-read data identifier, the data to be read in the data reading task can be determined. For example, the data identifier to be read may indicate a storage location of the data and may also indicate a category of the data.
Taking the MPP data processing engine as an example, the MPP data processing engine issues a data reading task to the MPP working node, and the MPP working node receives the data reading task, where the data reading task carries the data identifier a to be read, and according to the data identifier a to be read, the data a to be read in the data reading task can be determined.
In summary, the data to be read carried in the data reading task can be identified quickly, so as to provide favorable conditions for writing and reading data.
And 104, generating and sending a data reading request according to the to-be-read data identifier.
It should be noted that, on the basis of receiving the data reading task carrying the identification of the data to be read, in order to enable the data storage module to know the data to be read in the data reading task, a data reading request corresponding to the data reading task may be generated according to the identification of the data to be read, and sent to the data storage module, so that the data storage module writes the data to be read into the shared memory, and the data reading module can subsequently read the data from the shared memory.
Specifically, the data reading request carries the identifier of the data to be read, so that the receiving end can determine the data to be read according to the identifier of the data to be read carried in the data reading request, the receiving end can be understood as a data storage module, and further, the receiving end can be a local data storage module.
Based on this, in the embodiments of the present specification, a data reading request is generated according to a to-be-read data identifier carried in a data reading task, and the data reading request is sent to a receiving end.
According to the above embodiment, the data reading request is generated according to the to-be-read data identifier a carried in the data reading task, and it can be understood that the data reading request may also include the to-be-read data identifier a, and the data reading request is sent to the Kudu data storage module, so that the Kudu data storage module may determine the data a to be read according to the to-be-read data identifier a.
In summary, by sending the data reading request to the receiving end, based on the carried identifier of the data to be read, the receiving end can quickly determine the data requested to be read in the data reading request, thereby providing favorable conditions for writing and reading the data.
Step 106, receiving a response message for the data reading request, where the response message is used to indicate that the writing of the data corresponding to the to-be-read data identifier into the shared memory is completed.
It should be noted that, on the basis of the above-mentioned generating and sending a data read request according to the identifier of the data to be read, the receiving end receives the data read request, and after writing the data requested to be read in the data read request into the shared memory, sends a response message for the data read request to the data read module.
Specifically, the response message is information for the data read request sent by the receiving end of the data read request, and is used to indicate that the data corresponding to the identifier of the data to be read carried in the data read request is written into the shared memory to be completed.
It will be appreciated that the shared memory addresses are fixed, which is a fixed and unique address space of processes based on the shared memory identifier registration. Before the data reading process starts, the address information of the shared memory can be configured in the data reading module, the data reading module does not need to be configured repeatedly in each data transmission process, and the data transmission efficiency is improved.
In addition, the response message may also carry a storage address of the data in the shared memory, and the data reading module reads the data from the shared memory according to the storage address, so that the storage position of the data requested to be read in the shared memory can be quickly found, the data reading time is reduced, and the data reading efficiency is further improved. The storage address may be understood as a storage location of data corresponding to the identifier of the data to be read in the shared memory, and the data corresponding to the identifier of the data to be read may be read at the storage address, for example, the storage addresses of the data corresponding to the identifier of the data to be read carried in the response message in the shared memory are in columns 1 to 200, and the data corresponding to the identifier of the data to be read may be read in columns 1 to 200 in the shared memory.
Based on this, in the embodiments of the present specification, the data reading module receives a response message for the data reading request, where the response message is used to indicate data corresponding to the identifier of the data to be read, that is, the data requested to be read in the data reading request is written into the shared memory completely, the data reading module can know based on the response message that the data requested to be read has been written into the shared memory completely, and the data reading module can read the data from the shared memory; and the response message also carries the storage address of the data corresponding to the to-be-read data identifier in the shared memory, and the data reading module can find and read the data requested to be read at the storage address.
Along the above example, the MPP working node receives a response message for the data a carrying the identifier a of the data to be read, and the response message indicates that the writing of the data a corresponding to the identifier a of the data to be read into the shared memory is completed; and the response message carries the first row and the first column of the storage address of the data a corresponding to the data identifier a to be read in the shared memory, and according to the response message, the MPP working node can find the data a in the first row and the first column of the shared memory and read the data a.
In summary, through the interaction process of the data reading request and the response message between the data reading module and the data storage module, the data reading module sends the data requested to be read to the data storage module, and then completes the process of reading the data from the shared memory based on the response message sent by the data storage module. It is understood that after the data reading module sends the data reading request, the data storage module receiving the data reading request writes the data requested to be read into the shared memory, so that the data reading module can read the data in the shared memory based on the response message. The interaction process can quickly determine the data requested to be read, provides favorable conditions for the data reading module to read the data from the shared memory, further improves the data reading efficiency, is not limited by a network protocol during network communication, and further is not limited by the upper limit of data transmission.
And step 108, reading data from the shared memory according to the response message.
It should be noted that, on the basis of the above-mentioned receiving of the response message, the reading step can be executed according to the indication that the data writing into the shared memory in the response message is completed.
Specifically, the response message may be used to indicate that the data requested to be read in the data read request has been written into the shared memory.
Based on this, according to the response message, the data reading module receives the response message, and can determine that the data requested to be read has been written into the shared memory, and based on the response message, the data reading module can directly read the data in the shared memory.
In addition, under the condition that the response message carries the storage address of the data corresponding to the identification of the data to be read in the shared memory, the data reading module can find the data requested to be read at the storage address and read the data.
Along the above example, the MPP worker node may read the data a in the first row and the first column in the shared memory according to the response message carrying the first row and the first column with the storage address.
In summary, based on the storage address carried in the response message, the data reading module can quickly find the storage location of the data requested to be read in the shared memory, and perform quick positioning, so as to avoid that the data reading module can find the data requested to be read after traversing the shared memory, reduce the data reading time, and improve the data reading efficiency.
Further, in practical applications, it is usually necessary to read a large amount of data to facilitate data management and data processing and analysis. For a large amount of data requested to be read in the data reading request, if the shared memory is written once and read from the shared memory, the reading and writing difficulty is increased, and the reading and writing efficiency is reduced.
And receiving a response message aiming at the data reading request under the condition that the data reading task comprises a plurality of data reading events, wherein the response message is used for indicating that the writing of target data corresponding to a target data reading event in the plurality of data reading events into the shared memory is completed.
Specifically, the data reading task may include a task that needs to read a large amount of data, and the task is divided into a plurality of data reading events, that is, a large amount of data is batched into each data reading event, and each data reading event includes a reading part of data. Among the plurality of data read events, a target data read event is included, and the target data read event may be understood as an event that data is currently being read.
Based on this, in the embodiments of the present specification, a response message for a data read request is received, where the response message is used to indicate that target data in a target data read event has been written into the shared memory, so as to read the target data from the shared memory.
Correspondingly, according to the response message, determining target data which carries the identification of the data to be read and corresponds to the target data reading event; and reading the target data from the shared memory according to the storage address in the response information.
Specifically, under the condition that the response message carries the storage address of the data corresponding to the identification of the data to be read in the shared memory, the target data is read from the shared memory according to the storage address of the target data carried in the response message and the identification of the data to be read, so as to complete the target data reading event.
Then, a second target data read event may be determined among the multiple data read events, the received response message indicates that the second target data in the second target data read event has been written into the shared memory, and the second target data in the second target data read event is read from the shared memory according to the above steps. Accordingly, the above steps are repeated until the data in all data read events in the data read request are read.
It can be understood that the data reading module is configured to send a data reading request, where the data reading request includes a request to read a large amount of data, the receiving end batches the data requested to be read in the data reading request, and divides the data into a plurality of target data reading events, and sends a response message corresponding to each target data reading event after the target data in each target data reading event is written into the shared memory, and the data reading module may batch read the data from the shared memory based on the response message.
In the embodiment of the present specification, for a data reading task including reading a large amount of data, the data reading module may also batch the large amount of data, divide the large amount of data into each data reading request, and send the data reading request respectively, so as to realize batch data reading.
For example, the data reading task is to read 9500 pieces of data, based on the data reading task, the data reading module sends a data reading request for reading 9500 pieces of data, receives a response message for the data reading request, where the response message carries a signal that writing of 1000 pieces of data corresponding to the first target data reading event into the shared memory is completed, and carries a storage location of the 1000 pieces of data in the shared memory, and according to the storage location, the data reading module reads the 1000 pieces of data from the shared memory. And then, the data reading module continuously receives a signal that the writing of the 1000 pieces of data corresponding to the second target data reading event into the shared memory is completed, and repeats the steps to read the data from the shared memory until the 9500 pieces of data are read.
In summary, the response messages for each target data reading event are received respectively, and the target data corresponding to the target data reading event is read based on the response messages until all the data requested to be read in the data reading request is read, that is, a large amount of data requested to be read in the data reading request is divided into a plurality of target data reading events, so that batch reading of the large amount of data is realized, the difficulty in reading the data is reduced, and the efficiency of reading the data is improved.
Further, under the condition that the data reading task includes a large amount of data, reading the large amount of data in batches to improve the efficiency of data reading, however, under the condition of last reading, in order to improve the reading efficiency and reduce the reading error condition and the reading failure condition of the data reading module, the traversal completion signal carried in the response message may be used to indicate to the data reading module that all data requested to be read in the data reading request have been written into the shared memory in batches, and based on the traversal completion signal, the data reading module switches the state of the data reading task to the completion state, that is, after the process of this time of data reading is completed, data reading is not performed any more, and the specific implementation manner is as follows.
And under the condition that the response message carries a traversal completion signal, switching the state of the data reading task to a completion state according to the traversal completion signal.
Specifically, the traversal completion signal is used to indicate that all data requested to be read in the data reading request has been written into the shared memory, the data reading process is the last reading, and after the data reading is completed, the state of the data reading task is switched to the completion state to identify that the data reading task is completed, and the data reading is not performed any more.
Based on this, in the embodiment of the present invention, when the response message received by the data reading module carries a traversal completion signal that the data requested to be read in the data reading request has been completely written into the shared memory, the data reading module reads the data from the shared memory for the last time based on the traversal completion signal, and then switches the state of the data reading task to the completion state, which indicates that the data reading task is completed.
In the above example, in the last target data reading event, only 500 pieces of target data are written into the shared memory, and at this time, the response message carries a traversal completion signal indicating that 9500 pieces of data have been written into the shared memory, and based on the traversal completion signal, the data reading module reads the 500 pieces of data from the shared memory for the last time, and then switches the state of the data reading task to the completion state, which indicates that the data reading task of reading 9500 pieces of data is completed.
In summary, when the response message carries the traversal completion signal, the state of the data reading task is switched to the completion state to end the data reading task, so as to prevent the data reading module from having a reading failure or a reading error, reduce the reading error rate, and improve the reading efficiency.
Further, in order to enable the data reading module to directly read data from the shared memory, a data transmission relationship needs to be established with the shared memory to provide conditions for reading data, and the specific implementation manner is as follows.
And establishing a data transmission relation with the shared memory, and executing the step of reading data from the shared memory according to the response message.
Specifically, the data transmission relationship may be understood as data communication between the data reading module and the shared memory.
Based on this, in the embodiments of the present specification, the data reading module is connected to the shared memory, specifically, the data reading module may be connected to the shared memory identifier, and based on the shared memory identifier, the data reading module may access the shared memory, thereby establishing a data transmission relationship with the shared memory, and then executing a step of reading data from the shared memory according to the response message.
For example, the shared memory identifier is zzx, and the data reading module is connected to the shared memory zzx based on the shared memory identifier zzx, and further reads data from the shared memory zzx.
It is understood that the shared memory identifier may be understood as a registered unique identifier, which corresponds to a unique fixed shared memory space. The shared memory identifier may be transmitted from the data storage module to the data reading module in the communication process between the data reading module and the data storage module, or may be preset in the data reading module after the user obtains the shared memory identifier.
In summary, by establishing the data transmission relationship with the shared memory, conditions are provided for the data reading module to access the shared memory and read data from the shared memory.
To sum up, an embodiment of the present specification provides a data processing method, which receives a data reading task, where the data reading task carries a to-be-read data identifier, and the to-be-read data identifier may be used to identify data to be read, so as to quickly determine the data to be read, and further improve data reading efficiency; then, generating and sending a data reading request according to the to-be-read data identifier; receiving a response message aiming at the data reading request, wherein the response message is used for indicating that the writing of the data corresponding to the to-be-read data identifier into the shared memory is completed; through interaction between the data reading request and the response message, the data reading module can quickly read data from the shared memory, so that the data reading time is reduced, and the data reading efficiency is improved; and then, reading data from the shared memory according to the response message, and further completing the process of accessing the shared memory by the data reading module to read the data, so that the data reading module does not need to transmit the data with the data storage module through network communication, the limitation of a network protocol during the network communication is avoided, and further the communication efficiency is improved.
Corresponding to the foregoing data processing method embodiment, this specification further provides a data processing apparatus embodiment, and fig. 2 shows a schematic structural diagram of a data processing apparatus provided in an embodiment of this specification. As shown in fig. 2, the apparatus includes: the first receiving submodule 202 is configured to receive a data reading task, where the data reading task carries an identifier of data to be read; the sending submodule 204 is configured to generate and send a data reading request according to the to-be-read data identifier; the second receiving submodule 206 is configured to receive a response message for the data reading request, where the response message is used to indicate that writing of data corresponding to the to-be-read data identifier into the shared memory is completed; and a reading sub-module 208 configured to read data from the shared memory according to the response message.
In an optional embodiment, the second receiving submodule 206 is further configured to: and receiving a response message aiming at the data reading request under the condition that the data reading task comprises a plurality of data reading events, wherein the response message is used for indicating that the writing of target data corresponding to a target data reading event in the plurality of data reading events into the shared memory is completed.
In an alternative embodiment, the reading submodule 208 is further configured to: under the condition that the response message carries the storage address of the data corresponding to the identification of the data to be read in the shared memory, determining target data which carries the identification of the data to be read and corresponds to the target data reading event according to the response message; and reading the target data from the shared memory according to the storage address in the response information.
In an optional embodiment, the data processing apparatus further comprises a switching sub-module configured to: and under the condition that the response message carries a traversal completion signal, switching the state of the data reading task to a completion state according to the traversal completion signal.
In an optional embodiment, the data processing apparatus further comprises a connection submodule configured to: and establishing a data transmission relation with the shared memory, and executing the step of reading data from the shared memory according to the response message.
To sum up, in the data processing apparatus provided in an embodiment of the present specification, the first receiving sub-module 202 is configured to receive a data reading task, where the data reading task carries a to-be-read data identifier, and the to-be-read data identifier may be used to identify data to be read, so as to quickly determine the data to be read, and further improve data reading efficiency; the sending submodule 204 is configured to generate and send a data reading request according to the to-be-read data identifier; the second receiving submodule 206 is configured to receive a response message for the data reading request, where the response message is used to indicate that writing of data corresponding to the to-be-read data identifier into the shared memory is completed; through interaction between the data reading request and the response message, the data reading module can quickly read data from the shared memory, so that the data reading time is reduced, and the data reading efficiency is improved; the reading sub-module 208 is configured to read data from the shared memory according to the response message, and then complete a process of the data reading module accessing the shared memory to read the data, so that the data reading module does not need to perform data transmission with the data storage module through network communication, limitation of a network protocol during the network communication is avoided, and further communication efficiency is improved.
The above is a schematic configuration of a data processing apparatus of the present embodiment. It should be noted that the technical solution of the data processing apparatus and the technical solution of the data processing method belong to the same concept, and details that are not described in detail in the technical solution of the data processing apparatus can be referred to the description of the technical solution of the data processing method.
Corresponding to the data processing method for the data reading module, an embodiment of the present specification further provides a data processing method, where the data processing method is used for a data storage module, and the data storage module can write data requested to be read in a data reading request into a shared memory and send a response message indicating that the writing is completed to the data reading module, so that the data reading module can quickly and timely read the data from the shared memory, and fig. 3 shows a flowchart of a data processing method provided according to an embodiment of the present specification, and specifically includes the following steps.
Step 302, receiving a data reading request, where the data reading request carries an identifier of data to be read.
Specifically, the sending end of the data reading request may be a data reading module, and the data requested to be read in the data reading request may be determined according to the identifier of the data to be read.
Based on this, in the embodiments of the present specification, the execution subject of the data processing method may be a data storage module, for example, the data storage module may be a local Kudu data storage module. The data storage module receives a data reading request and can determine the data requested to be read according to the data identification to be read.
For example, the Kudu data storage module receives a data read request, where the data read request carries a data identifier a to be read, and the data requested to be read in the data read request may be determined to be the data a according to the data identifier a to be read.
In summary, the data requested to be read in the data reading request can be quickly determined through the data identifier to be read carried in the data reading request, and the data writing efficiency is improved.
Further, in order to enable the data storage module to write data into a space and enable the data reading module to read data from the space, a shared space, that is, a shared memory needs to be created to implement access of the data storage module and the data reading module, which is implemented as follows.
Registering a shared memory identifier, the shared memory identifier pointing to a process address space in a virtual memory; and taking the process address space as the shared memory.
Specifically, the shared memory identifier is used to represent a unique fixed process address space, which may be used as a shared memory.
Based on this, before writing the data corresponding to the data identifier to be read into the shared memory, the shared memory identifier is registered, and the process address space pointed by the shared memory identifier is used as the shared memory to store the data.
For example, the shared memory identifier obtained by registration is zzx, and the process address space zzx pointed by the shared memory identifier zzx is a shared memory.
In summary, the shared memory is created by registering the shared memory identifier, so that the data storage module and the data reading module can directly access the shared memory, thereby avoiding serialization and deserialization operations among processes, and saving a large amount of computing resources.
And step 304, writing the data corresponding to the data identifier to be read into a shared memory according to the data identifier to be read.
It should be noted that, on the basis of receiving the data reading request carrying the identifier of the data to be read, in order to obtain the data requested to be read in the data reading request, the data corresponding to the identifier of the data to be read may be determined according to the identifier of the data to be read, and the data corresponding to the identifier of the data to be read is written into the shared memory, so as to facilitate the reading operation of the data reading module.
Based on this, the data storage module determines the data requested to be read in the data reading request, that is, the data corresponding to the data identifier to be read, according to the data identifier to be read, and calls the data from the data storage module to write the data into the shared memory.
In the above example, the Kudu data storage module determines the data requested to be read as data a according to the data identifier a to be read, calls the data a from the Kudu data storage module, and transmits the data a to the shared memory.
In summary, the data storage module determines the data corresponding to the data identifier to be read according to the data identifier, finds the data from the data stored in the data storage module and transmits the data to the shared memory, so as to complete the writing of the data, and provide favorable conditions for the data reading module to read the data from the shared memory.
Step 306, sending a response message for the data reading request, where the response message is used to indicate that the writing of the data corresponding to the to-be-read data identifier into the shared memory is completed.
It should be noted that, after the data corresponding to the to-be-read data identifier is written into the shared memory, in order to enable the data reading module to quickly determine that the data corresponding to the to-be-read data identifier has been written into the shared memory, a response message for the data reading request may be sent to the data reading module to indicate that the data corresponding to the to-be-read data identifier has been written into the shared memory by the data reading module, so as to facilitate the reading operation of the data reading module.
In addition, the response message may also carry a storage address of the data corresponding to the identifier of the data to be read in the shared memory to indicate a storage location of the data reading module in the shared memory, which is convenient for the data reading module to quickly locate the data, thereby improving the reading efficiency.
Based on this, after the data corresponding to the identification of the data to be read is written into the shared memory, a response message is generated, and the response message is sent to the data reading module to indicate that the data requested by the data reading module is written into the shared memory completely, and the response message carries the storage address of the data corresponding to the identification of the data to be read in the shared memory, and the data reading module can read the data corresponding to the identification of the data to be read at the storage address.
In the above example, the Kudu data storage module sends a response message to the data reading module, where the response message is used to indicate that the data a has been written into the shared memory, and the response message also carries the first row and the first column of the storage address of the data a in the shared memory.
In summary, the data reading module can read the data a in the first row and the first column in the shared memory, and does not need to address the data in the shared memory, thereby saving the data reading time and improving the data reading efficiency.
Further, in practical applications, it is usually necessary to read a large amount of data to facilitate data management and data processing and analysis. For a large amount of data requested to be read in the data reading request, if the shared memory is written once and read from the shared memory, the reading and writing difficulty is increased, and the reading and writing efficiency is reduced.
Decomposing the data reading request into a plurality of data reading events, wherein each data reading event carries a data identifier to be read; the writing the data corresponding to the to-be-read data identifier into a shared memory according to the to-be-read data identifier includes: determining a target data read event of the plurality of data read events; and writing the target data corresponding to the target data reading event into the shared memory according to the to-be-read data identifier carried in the target data reading event.
Specifically, the data reading request may include a request to read a large amount of data, at this time, the large amount of data requested to be read in the data reading request may be grouped into a plurality of data reading events, and each data reading event includes a part of data requested to be read and a to-be-read data identifier corresponding to the part of data.
Based on this, a target data read event of the plurality of data read events is determined, which may be understood as an event currently being executed. And writing the target data corresponding to the data identifier to be read into the shared memory according to the data identifier to be read carried by the target data reading event, wherein the target data in the target data reading event is written completely. And then, determining a second target data reading event in the multiple data reading events, and writing second target data in the second target data reading event into the shared memory according to the steps. And repeating the steps until all the data requested to be read in the plurality of data reading events are written into the shared memory.
It is to be understood that, after the writing of the first target data in the first target data read event into the shared memory is completed, a response message corresponding to the first target data read event is sent to indicate that the writing of the first target data into the shared memory is completed and to indicate the storage address of the first target data, so that the data read module can read the first target data written into the shared memory in time. And then, writing the second target data in the second target data reading event, and correspondingly, after the writing is finished, sending a response message corresponding to the second target data reading event.
It should be noted that after the data reading module reads the first target data in the shared memory, when the data storage module writes the second target data into the shared memory, the second target data covers the first target data, so as to ensure that there is enough space in the shared memory to accommodate the data, avoid the addressing process of the data reading module, and improve the data reading efficiency.
For example, a data read request is a request to read 9500 pieces of data, based on the data read request, the Kudu data storage module divides the data read request into 10 data read events, where a first target data read event includes 1000 pieces of first target data, writes the first target data into the shared memory according to a to-be-read data identifier of the 1000 pieces of first target data, and sends a response message corresponding to the first target data read event, to indicate that the writing of the 1000 pieces of first target data into the shared memory is completed by the MPP worker node, and storage addresses carrying the 1000 pieces of first target data are from column 1 to column 1000. And then, performing the operations on the second target data reading event, the third target data reading event and the tenth target data reading event, and further completing the writing of 9500 pieces of data in the data reading request.
In summary, the data read request in the data read request is divided into the plurality of data read events, and the data in each data read event is written into the shared memory, so that the difficulty of writing the data is reduced, the sufficient space in the shared memory is ensured, and the data read efficiency is improved.
Further, after the data in the last target data reading event is written into the shared memory, in order to indicate to the data reading module that the data requested to be read has been completely written, a traversal completion signal is sent to the data reading module, which is specifically implemented as follows.
And sending a response message aiming at the data reading request under the condition that all the data corresponding to the data reading events are written into the shared memory, wherein the response message carries a traversal completion signal.
Specifically, the traversal completion signal is used to indicate that all data requested to be read in the data read request has been written into the shared memory.
Based on this, under the condition that the data requested to be read in the data reading request is completely written into the shared memory, the response message sent to the data reading module carries the traversal completion signal to indicate that the data requested to be read by the data reading module is completely written into the shared memory, and the data reading request is completed.
In the above example, in the last target data reading event, the Kudu data storage module needs to write 500 pieces of target data, at this time, the response message carries a traversal completion signal indicating that 9500 pieces of data have been completely written into the shared memory, and based on the traversal completion signal, the data reading module is instructed to read the 500 pieces of data from the shared memory for the last time, and then the data reading request is completed.
In summary, by sending a traversal completion signal indicating that all requested data has been written to the data reading module to end the next reading operation of the data reading module, the data reading module is prevented from having a reading failure or a reading error, the reading error rate is reduced, and the reading efficiency is improved.
Further, in order to improve the efficiency of processing and analyzing the big data, when the data corresponding to the to-be-read data identifier is written into the shared memory, the data can be written in a column-type manner, and the specific implementation manner is as follows.
Determining the storage position relation of the data corresponding to the data identifier to be read; and writing the data corresponding to the data identifier to be read into the shared memory in rows based on the storage position relation.
Specifically, the storage location relationship of the data can be understood as the storage location of the data in the data storage module, for example, the data is stored in the rows and columns.
Based on the above, the storage position of the data corresponding to the data identifier to be read is found from the data stored in the data storage module, and the data corresponding to the data identifier to be read is written into the shared memory in columns according to the storage position.
For example, the storage locations of the data 1-20 corresponding to the data identifier to be read in the Kudu data storage module are 1 st row, 1 st column to 5 th row, and 4 th column, and at this time, the data are written into the shared memory in the order from 1 st row, 1 st column to 5 th row, 1 st row, 2 nd column to 5 th row, 2 nd column … …, 5 th column to 1 st row, 5 th column to 5 th row.
In summary, by writing data into the shared memory in rows, the data analysis can be accelerated by hardware instructions without losing the access efficiency of data traversal in rows, and the data in the shared memory is convenient to index and compress, so that the data reading efficiency is improved.
To sum up, one embodiment of the present specification provides a data processing method, which receives a data reading request, where the data reading request carries a to-be-read data identifier, and the to-be-read data identifier may be used to identify data to be read, so as to quickly determine the data to be read, and further improve the efficiency of writing the data into a shared memory; then, according to the data identifier to be read, writing the data corresponding to the data identifier to be read into the shared memory, and writing the data requested to be read in the data reading request into the shared memory, so that a data reading module can directly access the shared memory conveniently, thereby obtaining the data from the shared memory, avoiding the serialization and deserialization operations among the processes, and saving a large amount of computing resources; and then, sending a response message aiming at the data reading request, wherein the response message is used for indicating that the data corresponding to the to-be-read data identifier is written into the shared memory completely, and indicating the data reading module by sending the response message, so that the data reading module can timely and quickly access the shared memory and read the data, and the data reading efficiency is improved.
Corresponding to the above data processing method embodiment, the present specification further provides a data processing apparatus embodiment, and fig. 4 shows a schematic structural diagram of a data processing apparatus provided in an embodiment of the present specification. As shown in fig. 4, the apparatus includes: a receiving submodule 402 configured to receive a data reading request, where the data reading request carries an identifier of data to be read; a writing sub-module 404 configured to write data corresponding to the to-be-read data identifier into a shared memory according to the to-be-read data identifier; the sending submodule 406 is configured to send a response message for the data reading request, where the response message is used to indicate that writing of data corresponding to the to-be-read data identifier into the shared memory is completed.
In an alternative embodiment, the write submodule 404 is further configured to: decomposing the data reading request into a plurality of data reading events, wherein each data reading event carries a data identifier to be read; the writing the data corresponding to the to-be-read data identifier into a shared memory according to the to-be-read data identifier includes: determining a target data read event of the plurality of data read events; and writing the target data corresponding to the target data reading event into the shared memory according to the to-be-read data identifier carried in the target data reading event.
In an optional embodiment, the sending submodule 406 is further configured to: and sending a response message aiming at the data reading request under the condition that all the data corresponding to the data reading events are written into the shared memory, wherein the response message carries a traversal completion signal.
In an alternative embodiment, the write submodule 404 is further configured to: determining the storage position relation of the data corresponding to the data identifier to be read; and writing the data corresponding to the data identifier to be read into the shared memory in rows based on the storage position relation.
In an optional embodiment, the data reading apparatus further comprises a creating sub-module 406, the creating sub-module 406 is configured to: registering a shared memory identifier, the shared memory identifier pointing to a process address space in a virtual memory; and taking the process address space as the shared memory.
To sum up, in the data processing apparatus provided in an embodiment of the present specification, the receiving sub-module 402 is configured to receive a data reading request, where the data reading request carries a to-be-read data identifier, and the to-be-read data identifier may be used to identify data that needs to be read, so as to quickly determine the data that needs to be read, and further improve the efficiency of writing the data into the shared memory; the write-in sub-module 404 is configured to write data corresponding to the to-be-read data identifier into the shared memory according to the to-be-read data identifier, and write data requested to be read in the data read request into the shared memory, so that the data read module can directly access the shared memory to obtain data from the shared memory, thereby avoiding serialization and deserialization operations among processes and saving a large amount of computing resources; the sending sub-module 406 is configured to send a response message for the data reading request, where the response message is used to indicate that the data corresponding to the to-be-read data identifier is written into the shared memory, and the data reading module is indicated by sending the response message, so that the data reading module can access the shared memory to read data quickly in time, and the data reading efficiency is improved.
The above is a schematic configuration of a data processing apparatus of the present embodiment. It should be noted that the technical solution of the data processing apparatus and the technical solution of the data processing method belong to the same concept, and details that are not described in detail in the technical solution of the data processing apparatus can be referred to the description of the technical solution of the data processing method.
Corresponding to the above data reading method and data writing method, an embodiment of the present specification further provides a data processing system, fig. 5 shows a schematic structural diagram of a data processing system 500 provided according to an embodiment of the present specification, and referring to fig. 5, the data processing system 500 includes: a data reading module 502 and a data storage module 504; the data reading module 502 is configured to receive a data reading task carrying an identifier of data to be read, generate a data reading request according to the identifier of the data to be read, and send the data reading request to the data storage module; the data storage module 504 is configured to receive the data read request carrying the to-be-read data identifier, respond to the data read request, write data corresponding to the to-be-read data identifier into a shared memory, and send a response message to the data read request to the data read module, where the response message is used to indicate that writing of data corresponding to the to-be-read data identifier into the shared memory is completed; the data reading module 502 is further configured to receive the response message, and read data from the shared memory according to the response message.
It is understood that the data reading module 502 can be used to execute the data reading method and the data storing module 504 can be used to execute the data writing method.
In an optional embodiment, the data reading module 502 is further configured to: and receiving a response message aiming at the data reading request under the condition that the data reading task comprises a plurality of data reading events, wherein the response message is used for indicating that the writing of target data corresponding to a target data reading event in the plurality of data reading events into the shared memory is completed.
In an optional embodiment, the data reading module 502 is further configured to: determining target data which carries the identification of the data to be read and corresponds to the target data reading event according to the response message; and reading the target data from the shared memory according to the storage address in the response information.
In an optional embodiment, the data reading module 502 is further configured to: and under the condition that the response message carries a traversal completion signal, switching the state of the data reading task to a completion state according to the traversal completion signal.
In an optional embodiment, the data reading module 502 is further configured to: and establishing a data transmission relation with the shared memory, and executing the step of reading data from the shared memory according to the response message.
In an optional embodiment, the data storage module 504 is further configured to: registering a shared memory identifier, the shared memory identifier pointing to a process address space in a virtual memory; and taking the process address space as the shared memory.
In an optional embodiment, the data storage module 504 is further configured to: decomposing the data reading request into a plurality of data reading events, wherein each data reading event carries a data identifier to be read; the writing the data corresponding to the to-be-read data identifier into a shared memory according to the to-be-read data identifier includes: determining a target data read event of the plurality of data read events; and writing the target data corresponding to the target data reading event into the shared memory according to the identification of the data to be read carried in the target data reading event.
In an optional embodiment, the data storage module 504 is further configured to: and sending a response message aiming at the data reading request under the condition that all the data corresponding to the data reading events are written into the shared memory, wherein the response message carries a traversal completion signal.
In an optional embodiment, the data storage module 504 is further configured to: determining the storage position relation of the data corresponding to the data identifier to be read; and writing the data corresponding to the data identifier to be read into the shared memory in rows based on the storage position relation.
Further, in an optional embodiment, since the data reading module sends the data reading request to the data storage module, and the data storage module sends the response message to the data reading module, that is, a communication relationship needs to be established between the data reading module and the data storage module, in order to improve the efficiency of data transmission, the data reading module and the data storage module are in communication connection through a remote procedure call protocol.
For example, carry out communication connection through remote procedure call protocol between MPP working node and the Kudu data storage module, because communication between MPP working node and the Kudu data storage module is a communication between the process, choose for use remote procedure call protocol can further improve data transmission's efficiency on the basis of maintaining the technical architecture of Kudu.
Further, in practical applications, a client usually issues an SQL request to a data processing system, and the data processing system needs to analyze the SQL request to determine a data reading task, and an embodiment of this specification takes an MPP data processing system as an example, so the data processing system further includes:
the coordination module is used for acquiring a data reading request submitted by a client and analyzing the data reading request to determine an analysis result.
Specifically, the data reading request submitted by the client may be understood as an SQL request, and the coordination module may be configured to perform semantic parsing on the SQL request.
And the execution planning module is used for determining a data reading task according to the analysis result.
Specifically, the execution planning module obtains SQL condition predicates, partition information and the like according to semantic analysis, assembles a Kudu data request, and requests a Kudu master node to obtain traversal node information.
And the scheduling task module is used for distributing the data reading task to the data reading module, wherein the data reading task carries a data identifier to be read.
It is understood that a plurality of data reading modules may be included in the data processing system, and the scheduling task module may allocate the data reading tasks to the corresponding data reading modules according to the traversal node information.
Specifically, the task scheduling module allocates the data reading task to the corresponding MPP working node to execute according to the node information where each data partition included in the traversal node information is located.
To sum up, in the data processing system provided in an embodiment of the present specification, the data reading module is configured to receive a data reading task carrying an identifier of data to be read, generate a data reading request according to the identifier of the data to be read, and send the data reading request to the data storage module, so that the data storage module can determine data requested to be read by the data reading module through interaction of the data reading request between the data reading module and the data storage module, so that the data storage module searches for the data requested to be read in its own database, and provides a condition for writing the data requested to be read into the shared memory.
The data storage module is configured to receive the data read request carrying the identifier of the data to be read, write data corresponding to the identifier of the data to be read into a shared memory in response to the data read request, and send a response message to the data read request to the data read module, where the response message is used to indicate that writing of the data corresponding to the identifier of the data to be read into the shared memory is completed, and the response message carries a storage address of the data corresponding to the identifier of the data to be read in the shared memory. Through the interaction of the response messages between the data reading module and the data storage module, the data reading module can timely know that the data requested to be read in the data reading request is written into the shared memory and completed, so that the data can be quickly read, and the data reading efficiency is improved.
And the shared memory is arranged between the data reading module and the data storage module to realize the writing and reading of data, and the data reading module and the data storage module are prevented from being in network communication so as not to be limited by a network protocol, thereby improving the communication efficiency and reducing the resource calculation occupation of a CPU.
And the data reading module is further configured to receive the response message, and read data from the shared memory according to the response message, so as to complete a data reading task and achieve data transmission.
The following further describes the data reading method and the data writing method by taking the data reading method and the data writing method provided by the present application as an example of data transmission between the MPP working node and the Kudu data storage node in conjunction with fig. 6 and fig. 7, where fig. 6 is a flowchart of a data reading + writing method provided in an embodiment of the present specification, and fig. 7 is a schematic structural diagram of a data reading + writing method provided in an embodiment of the present specification, and the specific steps are as follows.
Step 602, receiving a data reading task, where the data reading task carries a to-be-read data identifier.
Specifically, the MPP working node receives a data reading task issued by the MPP data processing system, and a to-be-read data identifier carried in the data reading task may be used to determine data to be read in the data reading task.
And step 604, generating and sending a data reading request to the Kudu data storage node according to the to-be-read data identifier.
Specifically, as shown in fig. 7, the MPP working node generates a data reading request according to the identifier of the data to be read, and it can be understood that the data reading request carries the identifier of the data to be read, and the data reading request is sent to the Kudu data storage node through a remote procedure call protocol.
Step 606, receive a data read request.
Specifically, the Kudu data storage node receives a data reading request, and determines data requested to be read according to a to-be-read data identifier carried by the data reading request.
Step 608, writing the data corresponding to the to-be-read data identifier into the shared memory according to the data read request.
Specifically, the Kudu data storage node determines the data requested to be read according to the data identifier to be read, and writes the data requested to be read into the shared memory. Referring to fig. 7, after receiving a data read request, a kudu data storage node writes data into a shared memory by columns, first writes data "1001, 1002, 1003, 1004" of a first type by columns, writes data "A, B, C, D" of a second type after all data of the first type are written into the shared memory, where a denotes an area a, B denotes an area B, C denotes an area C, and D denotes an area D, and writes data "100.0, 101.0, 120.0, 103.0" of a third type after writing data of the second type into the shared memory is completed. Among them, the first type of data may be used to represent the index of each region, and the third type of data may be used to represent the evaluation value of each region.
And step 610, sending a response message aiming at the data reading request to the MPP working node.
Specifically, after writing the data requested to be read into the shared memory, the Kudu data storage node sends a response message for the data read request to the MPP working node, where the response message is used to indicate that the data requested to be read by the MPP working node has been written into the shared memory, and based on the response message, the MPP working node may start reading the data from the shared memory, and the response message also carries a storage address of the data requested to be read in the shared memory, and the MPP working node may directly read the data at the storage address in the shared memory.
Step 612, receiving the response message, and reading data from the shared memory according to the response message.
Specifically, the MPP receives the response message, based on the response message, the MPP node may start to read data from the shared memory, and the MPP node may directly read data at a storage address in the shared memory.
And 614, switching the state of the data reading task to a completion state under the condition that the response message carries a traversal completion signal.
Specifically, the traversal completion signal is used to indicate that all data requested to be read in the data reading request is completely written into the shared memory, and when receiving a response message carrying the traversal completion signal, the MPP reads data from the shared memory, and then switches the state of the data reading task to a completion state, which indicates that all data in the data reading task has been completely read.
It can be understood that, under the condition that the response message does not carry the traversal completion signal, it indicates that the Kudu data storage node does not write all the data requested to be read into the shared memory at one time, the Kudu data storage node writes the data for multiple times until all the data requested to be read in the data read request is written into the shared memory, at this time, the above steps 608 to 612 are repeated until the Kudu data storage node completes writing all the data requested to be read in the data read request into the shared memory, and a response message carrying the traversal completion signal is sent to the MPP working node.
In summary, by the data reading method and the data writing method, reading and writing of data can be realized through the shared memory, which is used as data communication between processes, so that limitation of a network protocol is avoided, and the efficiency of data transmission is improved. FIG. 8 illustrates a block diagram of a computing device 800, according to one embodiment of the present description. The components of the computing device 800 include, but are not limited to, memory 810 and a processor 820. The processor 820 is coupled to the memory 810 via a bus 830, and the database 850 is used to store data.
Computing device 800 also includes access device 840, access device 840 enabling computing device 800 to communicate via one or more networks 860. Examples of such networks include the Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. Access device 840 may include one or more of any type of network interface (e.g., a Network Interface Card (NIC)) whether wired or wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) wireless interface, a worldwide interoperability for microwave access (Wi-MAX) interface, an ethernet interface, a Universal Serial Bus (USB) interface, a cellular network interface, a bluetooth interface, a Near Field Communication (NFC) interface, and so forth.
In one embodiment of the present description, the above-described components of computing device 800, as well as other components not shown in FIG. 8, may also be connected to each other, such as by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 8 is for purposes of example only and is not limiting as to the scope of the description. Those skilled in the art may add or replace other components as desired.
Computing device 800 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet computer, personal digital assistant, laptop computer, notebook computer, netbook, etc.), mobile phone (e.g., smartphone), wearable computing device (e.g., smartwatch, smart glasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 800 may also be a mobile or stationary server.
Wherein the processor 820 is configured to execute computer-executable instructions that, when executed by the processor, implement the steps of the above-described method.
The foregoing is a schematic diagram of a computing device of the present embodiment. It should be noted that the technical solution of the computing device and the technical solution of the method described above belong to the same concept, and details that are not described in detail in the technical solution of the computing device can be referred to the description of the technical solution of the method described above.
An embodiment of the present specification also provides a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the above-described method.
The above is an illustrative scheme of a computer-readable storage medium of the embodiment. It should be noted that the technical solution of the storage medium belongs to the same concept as the technical solution of the above-mentioned method, and details that are not described in detail in the technical solution of the storage medium can be referred to the description of the technical solution of the above-mentioned method.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, read-Only Memory (ROM), random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer-readable medium may contain suitable additions or subtractions depending on the requirements of legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer-readable media may not include electrical carrier signals or telecommunication signals in accordance with legislation and patent practice.
It should be noted that, for the sake of simplicity, the foregoing method embodiments are described as a series of acts, but those skilled in the art should understand that the present embodiment is not limited by the described acts, because some steps may be performed in other sequences or simultaneously according to the present embodiment. Furthermore, those skilled in the art will appreciate that the embodiments described in this specification are presently preferred and that no acts or modules are required in the implementations of the disclosure.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The preferred embodiments of the present specification disclosed above are intended only to aid in the description of the specification. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the embodiments and the practical application, to thereby enable others skilled in the art to best understand and utilize the embodiments. The specification is limited only by the claims and their full scope and equivalents.

Claims (16)

1. A data processing method for a data reading module, comprising:
receiving a data reading task, wherein the data reading task carries a data identifier to be read;
generating and sending a data reading request to a data storage module according to the to-be-read data identifier;
receiving a response message sent by the data storage module for the data reading request, wherein the response message is used for indicating that the data corresponding to the to-be-read data identifier is called from the data storage module and written into a shared memory in a row, the shared memory is a process address space in which a shared memory identifier obtained by registration points to a virtual memory, and address information of the shared memory is pre-configured in the data reading module;
and reading data from the shared memory according to the response message.
2. The method of claim 1, wherein receiving the response message for the data read request comprises:
and receiving a response message aiming at the data reading request under the condition that the data reading task comprises a plurality of data reading events, wherein the response message is used for indicating that the writing of target data corresponding to a target data reading event in the plurality of data reading events into the shared memory is completed.
3. The method according to claim 2, wherein reading data from the shared memory according to the response message when the response message carries a storage address of data corresponding to the identifier of the data to be read in the shared memory comprises:
determining target data which carries the identification of the data to be read and corresponds to the target data reading event according to the response message;
and reading the target data from the shared memory according to the storage address in the response message.
4. The method according to any one of claims 1-3, wherein after the step of receiving a response message for the data read request, further comprising:
and under the condition that the response message carries a traversal completion signal, switching the state of the data reading task to a completion state according to the traversal completion signal.
5. The method of claim 1, wherein before the step of reading data from the shared memory according to the response message, the method further comprises:
and establishing a data transmission relation with the shared memory, and executing the step of reading data from the shared memory according to the response message.
6. A data processing apparatus, the apparatus being for a data reading module, comprising:
the first receiving submodule is configured to receive a data reading task, wherein the data reading task carries a to-be-read data identifier;
the sending submodule is configured to generate and send a data reading request to the data storage module according to the to-be-read data identifier;
a second receiving submodule configured to receive a response message sent by the data storage module for the data reading request, where the response message is used to indicate that data corresponding to the to-be-read data identifier is called from the data storage module and written into a shared memory in a row, the response message carries a storage address of the data corresponding to the to-be-read data identifier in the shared memory, the shared memory is a process address space in which a shared memory identifier obtained by registration points to a virtual memory, and address information of the shared memory is pre-configured in the data reading module;
and the reading submodule is configured to read data from the shared memory according to the response message.
7. A data processing method, for use in a data storage module, comprising:
receiving a data reading request sent by a data reading module, wherein the data reading request carries a data identifier to be read;
according to the data identifier to be read, calling data corresponding to the data identifier to be read from the data storage module and writing the data into a shared memory in a row, wherein the shared memory is a process address space in which a shared memory identifier obtained by registration points to a virtual memory, and the address information of the shared memory is pre-configured in the data reading module;
and sending a response message aiming at the data reading request to the data reading module, wherein the response message is used for indicating that the data corresponding to the to-be-read data identifier is completely written into the shared memory.
8. The method of claim 7, further comprising, after the receiving a data read request step:
decomposing the data reading request into a plurality of data reading events, wherein each data reading event carries a data identifier to be read;
the writing the data corresponding to the to-be-read data identifier into a shared memory according to the to-be-read data identifier includes:
determining a target data read event of the plurality of data read events;
and writing the target data corresponding to the target data reading event into the shared memory according to the to-be-read data identifier carried in the target data reading event.
9. The method of claim 8, wherein the step of sending a response message to the data read request further comprises:
and sending a response message aiming at the data reading request under the condition that all the data corresponding to the data reading events are written into the shared memory, wherein the response message carries a traversal completion signal.
10. The method according to claim 7, wherein writing the data corresponding to the to-be-read data identifier into a shared memory includes:
determining the storage position relation of data corresponding to the data identifier to be read;
and writing the data corresponding to the data identifier to be read into the shared memory in rows based on the storage position relation.
11. A data processing apparatus, for use with a data storage module, comprising:
the receiving submodule is configured to receive a data reading request sent by a data reading module, wherein the data reading request carries a to-be-read data identifier;
the write-in submodule is configured to call data corresponding to the to-be-read data identifier from the data storage module according to the to-be-read data identifier and write the data into a shared memory in a row, the shared memory is a process address space in which a shared memory identifier obtained through registration points to a virtual memory, and address information of the shared memory is pre-configured in the data read module;
and the sending submodule is configured to send a response message to the data reading module, where the response message is used to indicate that writing of the data corresponding to the to-be-read data identifier into the shared memory is completed.
12. A data processing system, comprising:
the data reading module and the data storage module;
the data reading module is used for receiving a data reading task carrying a data identifier to be read, generating a data reading request according to the data identifier to be read, and sending the data reading request to the data storage module;
the data storage module is configured to receive the data reading request carrying the to-be-read data identifier, respond to the data reading request, call and write data corresponding to the to-be-read data identifier from the data storage module into a shared memory in a row, and send a response message to the data reading request to the data reading module, where the response message is used to indicate that writing of the data corresponding to the to-be-read data identifier into the shared memory is completed, the shared memory is a process address space in which a shared memory identifier obtained by registration points to a virtual memory, and address information of the shared memory is pre-configured in the data reading module;
the data reading module is further configured to receive the response message, and read data from the shared memory according to the response message.
13. The system of claim 12, wherein the data reading module and the data storage module are communicatively coupled via a remote procedure call protocol.
14. The system of claim 12, further comprising:
the coordination module is used for acquiring a data reading request submitted by a client and analyzing the data reading request to determine an analysis result;
the execution plan module is used for determining a data reading task according to the analysis result;
and the scheduling task module is used for distributing the data reading task to the data reading module, wherein the data reading task carries the identification of the data to be read.
15. A computing device, comprising:
a memory and a processor;
the memory is configured to store computer-executable instructions and the processor is configured to execute the computer-executable instructions, which when executed by the processor implement the steps of the method of any one of claims 1-5, 7-10.
16. A computer-readable storage medium storing computer-executable instructions which, when executed by a processor, implement the steps of the method of any one of claims 1-5, 7-10.
CN202210599316.1A 2022-05-30 2022-05-30 Data processing method and device Active CN114691051B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210599316.1A CN114691051B (en) 2022-05-30 2022-05-30 Data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210599316.1A CN114691051B (en) 2022-05-30 2022-05-30 Data processing method and device

Publications (2)

Publication Number Publication Date
CN114691051A CN114691051A (en) 2022-07-01
CN114691051B true CN114691051B (en) 2022-10-04

Family

ID=82144446

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210599316.1A Active CN114691051B (en) 2022-05-30 2022-05-30 Data processing method and device

Country Status (1)

Country Link
CN (1) CN114691051B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN202067266U (en) * 2011-01-22 2011-12-07 吴敏萍 Maintenance device for reading local internal memory by remote node in distributed type shared internal memory system
CN106331148A (en) * 2016-09-14 2017-01-11 郑州云海信息技术有限公司 Cache management method and cache management device for data reading by clients
CN107491355A (en) * 2017-08-17 2017-12-19 山东浪潮商用系统有限公司 Funcall method and device between a kind of process based on shared drive
CN107992271A (en) * 2017-12-21 2018-05-04 郑州云海信息技术有限公司 Data pre-head method, device, equipment and computer-readable recording medium
CN108833477A (en) * 2018-05-16 2018-11-16 百度在线网络技术(北京)有限公司 Method for message transmission, system and device based on shared drive
CN109522053A (en) * 2017-09-20 2019-03-26 阿里巴巴集团控股有限公司 A kind of massive parallel processing and data processing method
CN114254036A (en) * 2021-11-12 2022-03-29 阿里巴巴(中国)有限公司 Data processing method and system

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2167632A1 (en) * 1995-01-23 1996-07-24 Leonard R. Fishler Apparatus and method for efficient transfer of data and events between processes and between processes and drivers in a parallel, fault tolerant message based operating system
CN110399229A (en) * 2018-04-25 2019-11-01 清华大学 Communication means, device, system, medium and terminal between process
CN112860506A (en) * 2019-11-28 2021-05-28 阿里巴巴集团控股有限公司 Monitoring data processing method, device, system and storage medium
CN113939805A (en) * 2020-04-29 2022-01-14 华为技术有限公司 Method and system for interprocess communication
CN113297320A (en) * 2020-07-24 2021-08-24 阿里巴巴集团控股有限公司 Distributed database system and data processing method
CN112099967A (en) * 2020-08-20 2020-12-18 深圳市元征科技股份有限公司 Data transmission method, terminal, device, equipment and medium
CN113297110A (en) * 2021-05-17 2021-08-24 阿里巴巴新加坡控股有限公司 Data acquisition system, method and device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN202067266U (en) * 2011-01-22 2011-12-07 吴敏萍 Maintenance device for reading local internal memory by remote node in distributed type shared internal memory system
CN106331148A (en) * 2016-09-14 2017-01-11 郑州云海信息技术有限公司 Cache management method and cache management device for data reading by clients
CN107491355A (en) * 2017-08-17 2017-12-19 山东浪潮商用系统有限公司 Funcall method and device between a kind of process based on shared drive
CN109522053A (en) * 2017-09-20 2019-03-26 阿里巴巴集团控股有限公司 A kind of massive parallel processing and data processing method
CN107992271A (en) * 2017-12-21 2018-05-04 郑州云海信息技术有限公司 Data pre-head method, device, equipment and computer-readable recording medium
CN108833477A (en) * 2018-05-16 2018-11-16 百度在线网络技术(北京)有限公司 Method for message transmission, system and device based on shared drive
CN114254036A (en) * 2021-11-12 2022-03-29 阿里巴巴(中国)有限公司 Data processing method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
提高勘探大数据属性分析效率的策略;林茂等;《信息技术》;20151125(第11期);全文 *

Also Published As

Publication number Publication date
CN114691051A (en) 2022-07-01

Similar Documents

Publication Publication Date Title
US20170364697A1 (en) Data interworking method and data interworking device
US20050071209A1 (en) Binding a workflow engine to a data model
CN111475584B (en) Data processing method, system and device
CN112380180A (en) Data synchronization processing method, device, equipment and storage medium
CN108881227B (en) Operation control method and device of remote whiteboard system and remote whiteboard system
CN108363741B (en) Big data unified interface method, device, equipment and storage medium
CN110706148B (en) Face image processing method, device, equipment and storage medium
CN112905596B (en) Data processing method, device, computer equipment and storage medium
CN110781159A (en) Ceph directory file information reading method and device, server and storage medium
CN114691051B (en) Data processing method and device
CN108696559B (en) Stream processing method and device
CN110515979B (en) Data query method, device, equipment and storage medium
CN110222046B (en) List data processing method, device, server and storage medium
CN110909072B (en) Data table establishment method, device and equipment
CN112445861A (en) Information processing method, device, system and storage medium
CN110781137A (en) Directory reading method and device for distributed system, server and storage medium
CN114153547B (en) Management page display method and device
CN115982230A (en) Cross-data-source query method, system, equipment and storage medium of database
CN115905151A (en) Method, system and device for querying circulation information based on backup log
CN115422270A (en) Information processing method and device
CN114816408A (en) Information processing method and device
CN114637969A (en) Target object authentication method and device
CN113467823A (en) Configuration information acquisition method, device, system and storage medium
CN113268507A (en) Database data reading system, method and device and electronic equipment
CN110019448B (en) Data interaction method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant