WO2021258831A1 - Data processing method and system - Google Patents

Data processing method and system Download PDF

Info

Publication number
WO2021258831A1
WO2021258831A1 PCT/CN2021/088588 CN2021088588W WO2021258831A1 WO 2021258831 A1 WO2021258831 A1 WO 2021258831A1 CN 2021088588 W CN2021088588 W CN 2021088588W WO 2021258831 A1 WO2021258831 A1 WO 2021258831A1
Authority
WO
WIPO (PCT)
Prior art keywords
read
file
metadata
data
slice
Prior art date
Application number
PCT/CN2021/088588
Other languages
French (fr)
Chinese (zh)
Inventor
朱琦
崔宝龙
王俊捷
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021258831A1 publication Critical patent/WO2021258831A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5066Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]

Definitions

  • This application relates to the computer field, and in particular to a data processing method and system.
  • a computing node When a computing node performs big data or AI tasks, it needs to load data files on other devices or platforms into the memory of the computing node, and then the computing node completes the relevant calculation processing of the big data or AI tasks based on the data in the memory .
  • the efficiency of the computing node to read the file is very low, and the time required for the computing node to load the data file into the memory even exceeds the computing node to complete the big data based on the data in the memory Or the time required for AI tasks seriously affects the efficiency of big data or AI tasks.
  • This application provides a data processing method and system, which can improve the efficiency of reading files by computing nodes.
  • a data processing method is provided, which is applied to a data processing system.
  • the data processing system includes a computing node and a storage node.
  • the data processing method includes the following steps: the computing node obtains metadata of a file to be read, and The data includes the number of lines in the file to be read and the starting position of each slice in the file to be read. Then, according to the starting position of each slice in the file to be read in the metadata, each slice is read concurrently Finally, according to the order of the starting position of each slice in the file to be read, the data of each slice is stored in the memory space, where the memory space is requested according to the number of rows in the metadata.
  • the storage node Since the storage node generates the metadata of the file to be read in advance, when the computing node reads the file to be read, it can obtain the number of rows of the file to be read and the number of rows in the file to be read according to the metadata of the file to be read.
  • the starting position in the file so as to achieve the purpose of one-time application of memory space and multiple threads to read the file concurrently, avoiding the waste of resources caused by multiple expansions of the memory space due to the inability to determine the number of lines of the file to be read.
  • Concurrent reading of files greatly improves the speed at which computing nodes read files, and further improves the processing efficiency of big data and AI tasks.
  • the metadata of the file to be read may also include the number of slices, so before the computing node concurrently reads the data of each slice according to the starting position of each slice in the file to be read ,
  • the computing node can create multiple threads according to the number of slices, and then the computing node can call multiple threads to read the data of each slice concurrently.
  • a storage node when a storage node generates metadata, it can determine the number of slices x based on the hardware processing capacity of the computing node.
  • the computing node reads metadata, it will create y based on the number of slices x and the current processing capacity of the computing node. Threads, and call the y threads to read x slices concurrently.
  • the number y of multiple threads may be equal to the number of slices x.
  • each thread processes a slice, and y threads can read the file to be read in parallel to achieve an excellent processing state, which greatly improves the speed of the computing node to read the file, and further improves the processing of big data and AI tasks efficient.
  • the number y of multiple threads can be less than the number of slices x.
  • the number of threads created is less than the number of slices, one thread can process one slice first, and then after each thread finishes reading 1 slice, continue Read the next slice from the remaining slices until all slices have been read. It is also possible that some threads process only one slice, and some threads process multiple slices. For threads that need to process p slices, they can read directly from the starting position of the current slice to the starting position of the p+1th slice. In this way, the purpose of processing multiple slices by one thread is achieved, and the purpose of concurrently reading multiple slices of the file to be read is realized when the number of threads is less than the number of slices.
  • the computing node can flexibly create the number of threads according to the current processing capacity. If the number of threads that the processor can currently create is equal to the number of slices, then multiple threads can be called to read multiple slices of the file to be read in parallel, of which one thread only Process a slice to achieve the best processing state, which greatly improves the efficiency of the computing node to read the file; if the number of threads that the processor can currently create is lower than the number of threads, then multiple slices of the file to be read can be read concurrently , One thread can process multiple slices, avoiding the possibility of concurrent read failure due to the current heavy load of the computing node and the reduction of processing capacity. The reduction in the number of threads will not affect the concurrent reading of files, ensuring the feasibility of the solution.
  • the metadata of the file to be read is generated according to the metadata format and the file to be read after the storage node determines the metadata format of the file to be read according to the data type of the file to be read, Among them, different data types have different metadata formats for the files to be read.
  • the storage node parses the file to be read in advance, determines the metadata format of the file to be read according to the data type of the file to be read, generates metadata for reading the file to be read, and then converts the metadata of the file to be read
  • the data is stored, so that when the computing node reads the file, it can effectively initialize the data structure of the memory according to the metadata of the file to be read, and read the file to be read concurrently, thereby improving the efficiency of reading the file.
  • metadata is highly scalable, and metadata can be further added and enriched according to various information required when reading various types of data, making the applicability of the solution provided by this application very broad .
  • the metadata of the file to be read is stored in the file to be read, and the end of the file to be read includes the starting position of the metadata in the file to be read.
  • the computing node When the storage node obtains the metadata of the file to be read, it can obtain the starting position of the metadata in the file to be read from the end of the file to be read, and then read according to the starting position of the metadata in the file to be read. Get the metadata of the file to be read.
  • the metadata of the file to be read may be stored at the end of the file to be read, and the offset position of the metadata header and the check mask are written at the end of the file to be read.
  • the verification mask is located before the offset of the metadata header, so that when the computing node reads the metadata, it can set the read pointer at the end of the file, and then read a certain range of content in reverse to determine whether the data in the range has a correction. If there is a check mask, set the pointer at the check mask, read the offset position of the metadata header in the forward direction, and then set the read pointer at the offset position of the metadata header. Obtain the metadata from the read data.
  • the computing node can obtain the starting position of the metadata in the file to be read from the end of the file to be read, and then read the metadata without
  • the storage node additionally divides resources to store the metadata of the file to be read, which facilitates file management of the storage node and reduces the management burden of the storage node.
  • the metadata of the file to be read is stored in a designated path of the storage node.
  • the metadata storage location of the file to be read is the same as the storage location of the file to be read.
  • the file to be read and the metadata of the file to be read include a common identification
  • the computing node obtaining the metadata of the file to be read from the storage node includes: the computing node obtains the common identification of the file to be read from the storage node; computing The node obtains the metadata of the file to be read from the designated path or the storage location of the file to be read according to the common identification of the file to be read.
  • the storage node After the storage node sets a common identifier for the file to be read and the corresponding metadata, it stores the metadata in the specified path or the storage location of the file to be read. In this way, when the computing node reads the metadata, it can use the common identifier from The metadata is obtained from the specified path or the storage location of the file to be read without modifying the reading logic of the file, which can be applied to more computing nodes.
  • the metadata of the file to be read includes verification information.
  • the verification information is used to verify whether the metadata of the file to be read has changed after being stored in the storage node.
  • the computing node can According to the starting position of each slice in the file to be read, multiple threads are called, and before the data of each slice is read concurrently, the verification information is used to verify the metadata to confirm that the metadata is stored in the storage node After that, if there is no data loss or damage, the file to be read is read concurrently according to the metadata.
  • the computing node may call multiple threads according to the starting position of each slice in the file to be read, and before concurrently reading the data of each slice, the above method further includes the following steps: the computing node according to the verification information Check whether the metadata of the file to be read has changed after it is stored in the storage node. If the metadata of the file to be read has not changed after being stored in the storage node, the file to be read is checked according to each slice. At the starting position in, multiple threads are called to concurrently read the data of each slice.
  • the verification information may include a verification mask, a metadata verification value, a file verification value, a metadata format version, a file format version, etc., where the verification mask is used for the computing node to determine this Is the header of the metadata, so the check mask is usually located at the header of the metadata.
  • the metadata check value is used by the computing node to determine whether the metadata has changed after it is stored in the storage node. If it changes, the metadata may be damaged or lost.
  • the computing node can use other common data processing methods in the industry to read the data. Read the file.
  • the file check value is used by the computing node to determine whether the file has changed after being stored in the storage node. If the change indicates that the file may be damaged or lost, the computing node can return a data processing failure message.
  • the metadata format version is used by the computing node to determine whether it supports reading the data in this format version. If not, the computing node can use other data processing methods commonly used in the industry to read the file to be read.
  • the file format version is used for the computing node to determine whether it supports reading the file of this format version. If it does not support it, the computing node can use other common data processing methods in the industry to read the file to be read.
  • the above verification information may also include more or less content, which is not specifically limited in this application.
  • the method for verifying the above verification information can use verification methods commonly used in the industry, such as hash verification, sha256 verification, etc., which are not specifically limited in this application.
  • the computing node Before the computing node calls multiple threads to concurrently read the file to be read based on the metadata, it can first read the verification information in the metadata header to determine whether the metadata has changed after being stored in the storage node. In the case of changes, metadata is used to read the file to be read, so as to avoid the occurrence of a situation in which the computing node reads the file according to the wrong metadata information due to the metadata change, and improves the solution provided by this application. feasibility.
  • the metadata of the file to be read also includes the data type.
  • the metadata also includes the eigenvalue type.
  • the eigenvalue type is used for the computing node to initialize the memory
  • the computing node calls multiple threads according to the starting position of each slice in the file to be read, and before concurrently reading the data of each slice, it can also include the following steps: the computing node initializes the memory according to the data type Spatial data structure.
  • the computing node can initialize the memory data structure according to the feature value type in the metadata, to ensure that the file to be read will not cause data processing failure due to the error of the memory data structure, and to improve the reading efficiency of the file to be read.
  • the metadata of the file to be read also includes the number of values, The number of values is used to apply for the first memory space for storing data values and data column indexes.
  • the computing node calls multiple threads according to the starting position of each slice in the file to be read, and reads each slice concurrently.
  • the above method also includes the following steps: the computing node applies for a first memory space for storing data values and data column indexes according to the number of values, applies for a second memory space for storing row data according to the number of rows, and according to the first memory space and The second memory space obtains the memory space.
  • the computing node can apply for memory space according to the number of values and rows in the metadata to ensure that the file to be read with the data type of the sparse matrix can apply for memory space at one time. There is no need to expand the memory space multiple times, which avoids waste of resources and improves the efficiency of reading files to be read.
  • the starting position of each slice in the file to be read includes the starting position of the data column index of each slice and the data value of each slice The starting position and the starting position of the row data amount of each slice
  • the computing node stores the data of each slice in the memory space in the order of the starting position of each slice in the file to be read, including: the computing node according to each slice The order of the starting position of the data column index of each slice and the starting position of the data value of each slice, the data column index and data value of each slice are stored in the first memory space, according to the row data of each slice The order of the starting position of the amount, the row data amount of each slice is stored in the second memory space.
  • the computing node can include the starting position of the data column index of each slice and the data value of each slice according to the starting position of each slice in the file to be read The starting position and the starting position of the row data amount of each slice reads the three rows of data of the sparse matrix to ensure that the file to be read with the data type of the sparse matrix can also be read concurrently, improving the reading efficiency of the file to be read .
  • the data processing system includes a computing node and a storage node.
  • the above data processing method includes the following steps: the storage node obtains the file to be read, and Fetch the file to obtain the metadata of the file to be read, where the metadata of the file to be read includes the number of slices, the number of rows, and the starting position of each slice in the file to be read, where ,
  • the number of rows is used by the computing node to apply for memory space for storing the file to be read
  • the number of slices is used for the computing node to create multiple threads
  • the starting position of each slice in the file to be read is used for the computing node Call multiple threads, read the data of each slice concurrently, and store the data of each slice in the memory space in the order of the starting position of each slice in the file to be read
  • the storage node stores the data to be read Get the metadata of the file.
  • the storage node Since the storage node generates the metadata of the file to be read in advance, when the computing node reads the file to be read, it can determine the length of the file to be read, the number of slices, and the number of slices to be read according to the metadata of the file to be read. Read the starting position and other information in the file, so as to achieve the purpose of applying for memory space at one time and reading files concurrently by multiple threads, which not only avoids incorrect initialization of the memory space data structure and failure of data processing due to the inability to determine the data type.
  • the problem also avoids the waste of resources caused by multiple expansions of the memory space due to the inability to determine the number of lines of the file to be read, and the file can be read concurrently, so that the speed of the computing node to read the file is greatly improved, and the big data is further improved And the processing efficiency of AI tasks.
  • the specific process for the storage node to obtain the metadata of the file to be read may be as follows: the storage node parses the file to be read, determines the data type of the file to be read, and then according to the data type of the file to be read The data type determines the metadata format of the file to be read. Different data types have different metadata formats for the file to be read. Finally, according to the metadata format of the file to be read and the file to be read, the file to be read is generated The metadata of the file.
  • the storage node parses the file to be read in advance, determines the metadata format of the file to be read according to the data type of the file to be read, generates metadata for reading the file to be read, and then converts the metadata of the file to be read
  • the data is stored, so that when the computing node reads the file, it can effectively initialize the data structure of the memory according to the metadata of the file to be read, and read the file to be read concurrently, thereby improving the efficiency of reading the file.
  • metadata is highly scalable, and metadata can be further added and enriched according to various information required when reading various types of data, making the applicability of the solution provided by this application very broad .
  • the specific steps for the storage node to store the metadata of the file to be read may be as follows: the storage node stores the metadata of the file to be read in the file to be read, and the end of the file to be read Including the starting position of the metadata in the file to be read, so that the computing node obtains the starting position of the metadata in the file to be read from the end of the file to be read, and then according to the starting position of the metadata in the file to be read Start position, read the metadata of the file to be read.
  • the metadata of the file to be read can be stored at the end of the file to be read, and the offset position of the metadata header and the check mask are written at the end of the file to be read, where the check mask is located at Before the metadata header is offset, when the computing node reads the metadata, it can set the read pointer at the end of the file, and then read a certain range of content in reverse to determine whether the data in the range has a check mask. If there is a check mask, set the pointer at the check mask, read the offset position of the metadata header in the forward direction, then set the read pointer at the offset position of the metadata header, and read the data in the forward direction Get this metadata.
  • the computing node can obtain the starting position of the metadata in the file to be read from the end of the file to be read, and then read the metadata without
  • the storage node additionally divides resources to store the metadata of the file to be read, which facilitates file management of the storage node and reduces the management burden of the storage node.
  • the specific steps of the storage node storing the metadata of the file to be read may be as follows: the storage node stores the metadata of the file to be read in a designated path of the storage node.
  • the specific steps of the storage node storing the metadata of the file to be read may be as follows: the storage node stores the metadata of the file to be read in the storage location of the file to be read.
  • the metadata of the file to be read and the file to be read include a common identifier
  • the common identifier is used by the computing node to obtain the file to be read from a specified path or a storage location of the file to be read according to the common identifier. Get the metadata of the file.
  • the storage node After the storage node sets a common identifier for the file to be read and the corresponding metadata, it stores the metadata in the specified path or the storage location of the file to be read. In this way, when the computing node reads the metadata, it can use the common identifier from The metadata is obtained from the specified path or the storage location of the file to be read without modifying the reading logic of the file, which can be applied to more computing nodes.
  • the metadata storage method can be flexibly determined according to the application environment, so that the data processing methods and data processing methods provided in this application are more widely used.
  • the metadata of the file to be read includes verification information, and the verification information is used for the computing node to verify whether the metadata of the file to be read has changed after being stored in the storage node.
  • the verification information may include a verification mask, a metadata verification value, a file verification value, a metadata format version, a file format version, etc., where the verification mask is used for the computing node to determine this Is the header of the metadata, so the check mask is usually located at the header of the metadata.
  • the metadata check value is used by the computing node to determine whether the metadata has changed after it is stored in the storage node. If it changes, the metadata may be damaged or lost.
  • the computing node can use other common data processing methods in the industry to read the data. Read the file.
  • the file check value is used by the computing node to determine whether the file has changed after being stored in the storage node. If the change indicates that the file may be damaged or lost, the computing node can return a data processing failure message.
  • the metadata format version is used by the computing node to determine whether it supports reading the data in this format version. If not, the computing node can use other data processing methods commonly used in the industry to read the file to be read.
  • the file format version is used for the computing node to determine whether it supports reading the file of this format version. If it does not support it, the computing node can use other common data processing methods in the industry to read the file to be read.
  • the above verification information may also include more or less content, which is not specifically limited in this application.
  • the method for verifying the above verification information can use verification methods commonly used in the industry, such as hash verification, sha256 verification, etc., which are not specifically limited in this application.
  • the storage node writes the verification information into the metadata header of the file to be read, so that the computing node can read the verification information in the metadata header before calling multiple threads to concurrently read the file to be read based on the metadata , To determine whether the metadata has been changed after it is stored in the storage node. If no changes have occurred, the metadata is used to read the file to be read, so as to avoid the metadata change that causes the computing node to The occurrence of metadata information to read files has improved the feasibility of the solution provided by this application.
  • the metadata of the file to be read also includes the data type.
  • the metadata also includes the eigenvalue type.
  • the eigenvalue type is used by the computing node according to the characteristic value. Type initializes the data structure of the memory space.
  • the storage node puts the feature value type into the metadata of the dense matrix, so that the computing node can initialize the memory data structure according to the feature value type in the metadata to ensure that the file to be read will not cause data processing failure due to memory data structure errors. Improve the efficiency of reading files to be read.
  • the metadata of the file to be read also includes the number of values
  • the file to be read includes the data value, data column index, and row data volume.
  • the metadata also includes the number of values. The number of values is used by the computing node to apply for storing data values and data column indexes. The first memory space, the number of rows is used by the computing node to apply for the second memory space for storing the amount of row data, and the memory space of the file to be read includes the first memory space and the second memory space.
  • the storage node puts the data value into the metadata of the sparse matrix, and the computing node can apply for memory space according to the number of values and rows in the metadata to ensure that the data type is
  • the file to be read in the sparse matrix can apply for memory space at one time without the need to expand the memory space multiple times, avoiding waste of resources and improving the reading efficiency of the file to be read.
  • the starting position of each slice in the file to be read includes the starting position of the data column index of each slice and the data value of each slice The starting position and the starting position of the row data amount of each slice.
  • the computing node can include the starting position of the data column index of each slice and the data value of each slice according to the starting position of each slice in the file to be read The starting position and the starting position of the row data amount of each slice reads the three rows of data of the sparse matrix to ensure that the file to be read with the data type of the sparse matrix can also be read concurrently, improving the reading efficiency of the file to be read .
  • a computing node which includes modules for executing the data processing method in the first aspect or any one of the possible implementation manners of the first aspect.
  • a storage node which includes modules for executing the data processing method in the second aspect or any one of the possible implementation manners of the second aspect.
  • a data processing system including a computing node and a storage node.
  • the computing node is used to implement the operation steps of the method described in the first aspect or any one of the possible implementations of the first aspect.
  • the storage node It is used to implement the operation steps of the method described in the second aspect or any one of the possible implementation manners of the second aspect.
  • a computer program product which when running on a computer, causes the computer to execute the methods described in the above aspects.
  • a computer-readable storage medium is provided, and instructions are stored in the computer-readable storage medium, which when run on a computer, cause the computer to execute the methods described in the foregoing aspects.
  • FIG. 1 is a schematic diagram of the architecture of a multi-core processor provided by the present application
  • Fig. 2 is a schematic diagram of the architecture of a data processing system provided by the present application.
  • Figure 3 is a schematic structural diagram of a data processing system provided by the present application.
  • Fig. 4 is a schematic diagram of the flow of steps of a data processing method provided by the present application.
  • FIGS 5 to 6 are schematic diagrams of the metadata format provided by this application.
  • Figure 7 is a format of a file to be read containing metadata provided by this application.
  • FIG. 8 is a schematic flowchart of steps of a data processing method provided by the present application.
  • FIG. 9 is a schematic flowchart of another data processing method provided by this application.
  • FIG. 10 is a schematic flowchart of another data processing method provided by this application.
  • FIG. 11 is a schematic flowchart of another data processing method provided by this application.
  • FIG. 12 is a schematic structural diagram of a computing node provided by this application.
  • FIG. 13 is a schematic diagram of the structure of a server provided by the present application.
  • FIG. 14 is a schematic structural diagram of a storage array provided by the present application.
  • Big data A collection of data that cannot be captured, managed, and processed with conventional software tools within a certain time frame.
  • the strategic significance of big data technology lies in the professional processing of massive amounts of data.
  • the processed data can be applied to various industries, including finance, automobiles, catering, telecommunications, energy, etc., for example, using big data technology and Internet of Things technology Of unmanned cars, using big data technology to analyze customer behavior for product recommendation, using big data technology to realize credit risk analysis, and so on.
  • Artificial Intelligence Theories, methods, technologies and application systems that use digital computers or computing nodes controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results.
  • the application scenarios of artificial intelligence are very wide, such as face recognition, vehicle recognition, pedestrian re-recognition, data processing applications, and so on.
  • the underlying model of AI is a collection of mathematical methods to achieve AI. A large number of samples can be used to train the AI model to make the trained AI model obtain the ability to predict. Among them, the samples used to train the AI model can be from big data. Samples obtained by the platform.
  • concurrency Two or more events occur at the same time in the same period of time.
  • concurrency refers to multiple threads operating the same resource to process the same or different tasks in a period of time. It should be noted that concurrency includes multiple threads operating at the same time (parallel) within a period of time, and also includes multiple threads operating alternately in time-sharing within a period of time.
  • the core of the processor is also called the core of the processor and is an important part of the processor.
  • the kernel can be understood as the executable unit of the processor, and all tasks of the processor, such as calculation, receiving/storing commands, and data processing, are executed by the core.
  • Thread is the smallest unit that the operating system can perform operation scheduling.
  • a core corresponds to at least one thread. Through hyper-threading technology, a core can also correspond to two or more threads, that is, multiple threads are running at the same time.
  • Multi-core processor One or more cores can be deployed in the processor. If the number M of cores deployed in the processor is not less than 2, the processor is called a multi-core processor.
  • the multi-core processor also includes a memory 109 for storing data, such as double data rate synchronous dynamic random access memory (DDR SDRAM).
  • DDR SDRAM double data rate synchronous dynamic random access memory
  • each core and the memory are connected in a bus 110, and each core can access the data in the memory by sharing the memory.
  • concurrent processing is the advantage of the multi-core processor, and the multi-core processor can call multiple threads in a specific clock cycle to concurrently process more tasks.
  • Multi-CPU multi-core processor also known as multi-chip multi-core processor, this processor contains multiple multi-core processor chips as shown in Figure 1. Multiple multi-core processor chips are connected through an interconnect structure, and the interconnect structure can be implemented in a variety of ways, such as a bus.
  • Figure 2 is a schematic diagram of the architecture of a big data or AI task processing system.
  • Figure 2 can also be referred to as a schematic diagram of the architecture of a data processing system.
  • the data processing system is used for computing nodes to implement file reading processes and storage nodes to implement files The stored procedure.
  • the system includes a computing node 210, a storage node 220, and a data collection node 230.
  • the processors on the computing node 210 and the storage node 220 are usually the multi-core processor 100 or the multi-CPU multi-core processor shown in FIG. 1.
  • the storage node 220, the data collection node 230, and the computing node 210 are connected through a network, and the network may be a wired network, a wireless network, or a mixture of the two.
  • the computing node 210 and the storage node 220 may be physical servers, such as X86 servers, ARM servers, etc.; they may also be virtual machines based on general physical servers combined with network functions virtualization (NFV) technology.
  • machine, VM the virtual machine refers to a function of a complete hardware system, a complete computer system running software simulation in a completely isolated environment, such as virtual machines in a cloud data center, the present application is not particularly limited.
  • the storage node 220 may also be other storage devices with storage functions, such as a storage array. It should be understood that the computing node and the storage node 220 may be a single physical server or a single virtual machine, and may also constitute a computer cluster, which is not specifically limited in this application.
  • the data collection node 230 can be a hardware device, for example, a physical server or a cluster of physical servers, or software, for example, a data collection system deployed in a server, a virtual machine, and the data collection system can collect data stored in other servers.
  • the log information in the website server can be collected, and the data collected by other hardware devices can also be collected. It should be understood that the above examples are only for illustration, and this application is not specifically limited.
  • FIG. 2 is a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship between nodes, modules, etc. shown in the figure does not constitute any limitation.
  • the computing node 210, the storage node 220, and the data collection node 230 in FIG. 2 are all described by taking three independent devices or server clusters as an example.
  • the computing node 210, the storage node 220, and the data collection node The node 230 may also be the same server cluster or server, or the computing node 210 and the storage node 220 may be the same server cluster or server, etc., which is not specifically limited in this application.
  • the data collection node 230 collects various raw data and sends them to the storage node 220.
  • the storage node 220 After the storage node 220 performs data processing on the received raw data, the file to be read is generated and stored in In the storage node 220, it should be understood that since the source of the original data is very wide and the data structure is very complex, the storage node 220 needs to "translate" the original data into a unified format that can be directly read and written by the processor for storage. , Data processing may include data cleaning, feature extraction, format conversion, etc., which is not specifically limited in this application.
  • the computing node 210 reads various files to be read from the storage node 220 and loads them into the memory 109 of the computing node 210.
  • the multi-core processor 100 of the computing node 210 completes big data or AI tasks according to the data in the memory 109 Related operations.
  • Fig. 2 illustrates that the second core 102 completes the AI task and the third core 103 completes the big data task as an example.
  • the multi-core processor 100 can process multiple tasks concurrently, and multiple cores can process multiple tasks in a specific clock cycle. This application does not specifically limit the processing of the same AI task, the same big data task, or the same data processing task.
  • the data collection node 230 is a cloud server deployed with specific services (for example, Kafka and/or Flume), where Kafka is used to provide a high-throughput and highly scalable distributed message queue service, and Flume is a high-throughput, highly-scalable distributed message queue service.
  • Kafka is used to provide a high-throughput and highly scalable distributed message queue service
  • Flume is a high-throughput, highly-scalable distributed message queue service.
  • the storage node 220 is a computer cluster deployed with a distributed file system (hadoop distributed file system, HDFS). Among them, the storage node 220 can also be deployed with a data processing system, such as Spark. Among them, Spark is a large-scale data processing system. Unified analysis engine.
  • the computing node 210 is a computer cluster deployed with Spark-ML, where Spark-ML is used to process machine learning (ML) tasks.
  • the cloud server (data collection node 230) deployed with Kafka and/or Flume can first generate massive amounts of raw data, and save the raw data in HDFS (storage node 220), which can be read by Spark of storage node 220 Perform data processing on the original data, such as feature extraction and format conversion on the original data, convert the original data into a data format that can be processed by machine learning or big data, generate the file to be read and save it in HDFS.
  • Spark-ML (computing node 230) reads the file to be read from HDFS and loads it into the memory 109.
  • the multi-core processor 100 performs machine learning tasks based on the memory data in the memory 109, such as k-means clustering algorithm (k-means clustering algorithm, K-means) or linear regression (linear regression) processing.
  • the computing node 210 when the computing node 210 is performing tasks such as big data and machine learning, it needs to first read the file to be read from the storage node 220, and load the file to be read into the memory 109 of the computing node 210 (in Figure 2) Step 1), the computing node 210 then completes the related operations of the big data or machine learning tasks according to the data in the memory 109 (step 2 in FIG. 2).
  • This application provides a data processing system 400 as shown in FIG. 3. It should be understood that using the data processing system 400 shown in FIG. 3 to perform data processing in the application scenario shown in FIG. 2 can greatly improve the data of the computing node 210. The processing speed further improves the efficiency of the computing node 210 in processing big data or AI tasks.
  • the data processing system 400 includes a computing node 210 and a storage node 220.
  • the specific form and connection manner of the computing node 210 and the storage node 220 can be implemented with reference to FIG. 1, and details are not repeated here.
  • the storage node 220 includes a metadata generating unit 221, which is used to generate metadata of the file to be read.
  • the metadata records basic information of the file to be read.
  • the basic information includes at least the number of lines of the file to be read, The maximum number of slices and the starting position of each slice in the file to be read.
  • the maximum number of slices in the file to be read is 3, the number of rows is 9, and the starting position of slice 1 is the first line of the file to be read.
  • the starting position of slice 2 is the 4th line of the file to be read
  • the starting position of slice 3 is the 7th line of the file to be read.
  • the metadata may also include more information, such as the type of feature value, the number of columns, etc., which may be specifically determined according to the data type of the file to be read, which is not specifically limited in this application.
  • the metadata generating unit 221 only records the maximum number of slices of the file to be read and the starting position of each slice in the file to be read.
  • the file to be read is not actually sliced.
  • the unsliced state is completely stored in the storage node 220.
  • metadata can be stored in the storage node together with the file to be read in the form of a separate file, or it can be integrated with the file to be read into a data processing in the storage node.
  • the specific storage process of metadata will Step S520 in the embodiment of FIG. 4 is described below.
  • the metadata generating unit 221 may generate corresponding metadata based on the original data when the storage node 220 receives the original data, or perform data processing on the original data at the storage node 220 (such as the aforementioned data cleaning, After feature extraction and format conversion, etc.), before the file to be read is generated, corresponding metadata is generated for the data after data processing. After the storage node 220 has generated the file to be read, the corresponding metadata can be generated according to the file to be read. Metadata, this application does not limit the input data of the metadata generating unit 221.
  • the computing node 210 includes a metadata reading unit 211 and a slice reading unit 212.
  • the metadata reading unit 211 is used to read the metadata of the file to be read
  • the slice reading unit 212 is used to determine the metadata to be read according to the metadata.
  • the number of slices is 3, and the number of threads can be 1 or 2 or 3.
  • Each data read request carries the starting position of a slice in the file to be read and the address of the previously applied memory space.
  • the data read request received by thread 1 carries the value of slice 1 in the file to be read.
  • the data read request received by thread 2 carries the starting position of slice 2 in the file to be read
  • the data reading request received by thread 3 carries the starting position of slice 3 in the file to be read.
  • the y threads concurrently read the slices of the file to be read according to the starting position of the received slice, and follow the order of the starting position of each slice in the file to be read , Write the read slice into the above memory space.
  • Figure 3 takes one core corresponding to one thread as an example (for example, in Figure 3, core 1 corresponds to thread 1, core 2 corresponds to thread 2, and core 3 corresponds to thread 3).
  • core 1 corresponds to thread 1 and thread 2
  • core 2 corresponds to thread 3
  • core 1 corresponds to thread 1 ⁇ Thread 3 and so on, so as to achieve the purpose of multiple cores to read files concurrently, improve resource utilization, and improve data processing efficiency.
  • the data collection node 230 is a cloud server deployed with Kafka and/or Flume
  • the storage node 220 is a computer cluster deployed with HDFS and Spark
  • the computing node 210 is a computer cluster deployed with Spark-ML.
  • the above-mentioned metadata generating unit 221 may be deployed in Spark
  • the metadata reading unit 211 and the slice reading unit 212 may be deployed in Spark-ML.
  • the cloud server (data collection node 230) deployed with Kafka and/or Flume can first generate a large amount of raw data, and save the raw data in HDFS (storage node 220), and the Spark of storage node 220 can be read first Take the original data for data processing, such as feature extraction and format conversion of the original data, and then generate the file to be read and the corresponding metadata based on the data after data processing, and then combine the file to be read and the corresponding metadata Stored in HDFS. Finally, when Spark-ML (computing node 230) reads the file to be read from HDFS, it first reads the metadata of the file to be read, and then applies for a continuous segment based on the information in the metadata.
  • multiple threads are called to concurrently read the file to be read, loaded in the previously requested memory space, and then perform machine learning tasks based on the memory data in the memory 109.
  • the computing node 230 reads the file to be read, it can not only read concurrently, but also avoid resource waste caused by multiple applications for memory and multiple copies of data, which greatly improves the efficiency of data processing.
  • the metadata reading unit 211 reads the metadata, it will determine whether the file to be read has corresponding metadata. If the file to be read does not have metadata, it can notify the slice in a thread.
  • the reading unit 212 reads the file to be read according to the current data processing method in the industry, which is not limited in this application.
  • the storage node 220 in the system generates metadata of the file to be read before the computing node 210 reads the file to be read, so that the computing node 210 can read the file to be read.
  • the purpose of threads to read files concurrently not only avoids the problem of incorrect initialization of the memory space data structure and failure of data processing due to the inability to determine the data type, but also avoids multiple expansions of the memory space due to the inability to determine the number of lines of the file to be read.
  • the resulting waste of resources, and the ability to read files concurrently greatly improves the speed at which the computing node 210 reads files, and further improves the processing efficiency of big data and AI tasks.
  • the storage node 220 needs to generate corresponding metadata according to the file to be read, and then store the file to be read and the corresponding metadata in the storage node 220. Therefore, the following first With reference to Fig. 5, the data processing method provided in this application will be described in detail.
  • the specific process of generating metadata by the metadata storage node 220 may include the following steps:
  • S510 Obtain the file to be read from the data collection node 230, and parse the file to be read to obtain metadata of the file to be read.
  • this application provides a variety of metadata formats to adapt to various application scenarios.
  • the storage node can first determine the data type of the file to be read, and then determine the metadata format of the file to be read according to the data type of the file to be read.
  • different data Types of files to be read have different metadata formats, and finally, according to the metadata format and the analysis result of the file to be read, the metadata of the file to be read is generated.
  • the metadata records the basic information of the file to be read.
  • the basic information includes at least the number of lines of the file to be read, the maximum number of slices, and the starting position of each slice in the file to be read. Therefore,
  • the format of the metadata may be as shown in FIG. 5, where the format of the metadata includes at least basic information 610, and the basic information 610 includes:
  • the number of rows is used to identify the total number of rows contained in the file to be read, for the computing node 210 to apply for memory space for storing the file to be read.
  • the number of slices is used to identify the number of slices contained in each file to be read, for the computing node 210 to apply for multiple threads to concurrently read the file to be read.
  • the number of slices is usually the maximum number of slices of the file to be read, and the maximum number of slices is an empirical value. It is understandable that if the number of slices of the file to be read is too large, the metadata length of the file to be read will be too large, which will affect the speed of the computing node 210 to read the metadata. If the number of slices of the file to be read is too small , Will cause a part of the cores to remain idle when the computing node 210 concurrently reads the file to be read, which causes a waste of resources. Therefore, the maximum number of slices of the file to be read can be determined according to the number of cores of the computing node 210. For example, the maximum number of slices is equal to the number of processor cores of the computing node 210, or the maximum number of slices is proportional to the number of processor cores. There is no specific limitation.
  • each slice is used for each thread to read the file to be read concurrently.
  • Each thread can read a file to be read according to the starting position of a slice in the file to be read And put it into the previously requested memory space, thereby completing the concurrent reading of the file to be read and improving the reading efficiency of the file to be read.
  • the starting position of each slice can be the offset value and line number of the starting position of each slice in the file to be read, and each thread can be based on the line number and the starting position of the next slice. Line number, determine the length l of the slice, then set the read pointer to the offset value, and read the slice with length l.
  • the starting position of each slice can also include more or less content, such as only the offset value of the starting position of each slice in the file to be read, or the starting position of each slice also includes The length of each slice is not limited in this application.
  • the metadata may also include verification information, which is used to improve the reliability of the metadata.
  • the metadata may also include verification information 620, where the verification information 620 includes:
  • the check mask is used for the computing node 210 to confirm that this is the header of the metadata. Therefore, the check mask is located at the header of the metadata.
  • the check mask of the metadata header can be checked first, and this application does not make specific restrictions. If the computing node 210 succeeds in verifying the check mask, it proves that the current position of the read pointer is the head of the metadata, the computing node 210 can start to read the metadata, and call multiple threads to read the pending data concurrently according to the metadata.
  • the slice reading unit 212 is called to read the file to be read according to the current data processing method in the industry, and this application is not limited to this.
  • the check mask can be represented by a binary value to speed up the processing efficiency
  • Metadata check value used to check whether the content of metadata information has changed
  • the file check value is used to check whether the data content in the file to be read has changed
  • Metadata format version used to record the format version of the current metadata information.
  • the computing node reads the metadata, if it does not support reading the metadata information in the latest format, it can also be compatible with the old version of the file;
  • the file format version is used to record the format information of the file currently to be read.
  • the computing node 210 when the computing node 210 reads the metadata, it can read the verification information 620 first, and after confirming that the metadata and the data content of the file to be read have not changed, and the version format is compatible, the basic data can be read again. Information 610, and then call multiple threads to read the file to be read concurrently. Therefore, the verification information 620 in the metadata format shown in FIG. 5 is located before the basic information 610. Of course, other methods can also be used to ensure that the computing node reads first The verification information 620 then reads other information of the metadata, which is not specifically limited in this application.
  • the verification information (4) to (8) in FIG. 5 are used for illustration, and the metadata may also include more or fewer types of verification information to ensure the reliability of the metadata, which is not specifically limited here.
  • the verification methods used in (4) to (6) above can use verification methods commonly used in the industry, such as hash verification, sha256 verification, etc., which are not specifically limited in this application.
  • the computing node needs different information when reading files to be read of different data types.
  • the data type of the file to be read is usually a dense matrix or a sparse matrix.
  • the computing node 210 needs to initialize the memory data structure according to the string type of the characteristic value of each column of the dense matrix to ensure that the file to be read will not be parsed or lost; and when the file to be read
  • the computing node 210 does not need to obtain the eigenvalues of each column of the matrix. Instead, it needs to apply for memory space for storing the "data value" and "data column index" according to the number of sparse matrix values.
  • Different types of metadata formats will also change. The following uses the dense matrix data type as an example to describe the metadata format.
  • the metadata may also include type information 630.
  • type information 630 includes:
  • the computing node 210 since the computing node 210 reads files to be read of different data types, it will execute different reading logic to read the files to be read. For example, a dense matrix requires additional initialization of the data structure of the memory space.
  • the data type 630 in 5 is located before the basic information 610. In this way, the computing node 210 first verifies the metadata and the file to be read according to the verification information 620, and then determines the reading logic of the computing node 210 according to the type information 630, and finally The basic information 610 and reading logic call multiple threads to concurrently read the file to be read.
  • other methods can also be used to ensure the order of reading various metadata information, which is not specifically limited in this application.
  • the data type of the file to be read is different, the metadata format is also different, and the content in the type information 630 is also different.
  • the type information 630 will not include (10), but additionally include:
  • the number of values is used to store the number of values of the sparse matrix.
  • the computing node 210 can apply for memory space according to the number of values of the sparse matrix. It should be understood that since the storage form of the sparse matrix is: a total of 3 rows of characters are included, and each data is saved by the 3 rows of characters, a row of characters represents the "data column index" corresponding to each data, and a row of characters represents each data corresponding "Data value”, a line of characters represents the "row data amount” corresponding to each data. Therefore, for a sparse matrix, (1) the number of rows is used to apply for the first memory space for storing the "row data amount", (11 ) Value quantity is used to apply for the second memory space for storing "data value” and "data column index".
  • each thread can read the data volume index, data value, and corresponding row data volume of a slice according to the starting position of the three-row data of a slice, and write the slice in the three-row format of the sparse matrix to the above-mentioned application
  • the memory space specifically, the computing node 210 can call multiple threads to read concurrently according to the starting position of the data column index of each slice, the starting position of the data value of each slice, and the starting position of the row data volume of each slice.
  • each slice and the data column of each slice are indexed to the first memory space, and multiple threads are called to concurrently read the row data amount of each slice to the second memory space, and the file to be read is obtained to realize multiple threads concurrency The purpose of reading multiple slices.
  • the computing node 210 when the computing node 210 reads a file to be read whose data type is a sparse matrix, the data type of the file to be read may be changed from The sparse matrix is converted into a dense matrix and then stored in the memory space. In the conversion process, the computing node 210 needs to know the number of columns of the sparse matrix and the original number of rows of each data in advance. The original number of rows here refers to the original data being converted into The number of rows in the original data where the sparse matrix is stored before the storage node 220.
  • the type information 630 may also include (12) the number of columns and (3.3) each slice
  • the starting position of the row data volume of each slice includes the offset value of the row data volume of each slice and the original number of rows.
  • each thread can read the data volume index of the slice according to the starting position of the three rows of data of each slice, The data value and the corresponding row data amount, and the slice is written into the memory space according to the number of rows and columns of the original data, so that multiple threads can read multiple slices of the sparse matrix concurrently, and convert the sparse matrix into a dense matrix. The purpose of entering the memory space.
  • the metadata formats shown in Figures 5 to 6 are only used for illustration.
  • the solution provided by this application is not only applicable to the above-mentioned data types (sparse matrix and dense matrix), but also applicable to other data types that can be itemized. Or data types read in batches, such as data in Libsvm format, will not be given examples and explanations one by one here.
  • the metadata of different data types can also include more or less content. Specifically, the content that the metadata needs to contain can be determined according to the information required by the computing node when reading the file to be read, which will not be expanded here. Go into details. S520: Store metadata and files to be read.
  • the storage node 220 stores the metadata in a designated path, or stores the metadata in the storage location of the file to be read, where the metadata of the file to be read and the file to be read contain a common identifier, such as the file to be read and the file to be read.
  • the file name of the metadata of the file to be read is the same, but the extension is different.
  • the storage path of the file to be read (dataA.exp) is /pathA/pathB/.../pathN/dataA.exp, where exp is the general data format of the file to be read, specifically csv, libsvm, etc.
  • the storage path of the metadata (dataA.metadata) of the file to be read is pathA/pathB/.../pathN/dataA.metadata.
  • the computing node 210 when the computing node 210 reads the file to be read, it can directly search for the metadata corresponding to the file to be read that contains the common identifier from the reading path of the file to be read.
  • the storage node 220 may also store the metadata of all files in a specified path.
  • the computing node 21 When the computing node 21 reads the file to be read, it may search for the metadata corresponding to the file to be read from the specified path according to the common identifier.
  • the storage node 220 may also store the metadata of the file to be read in the file to be read, and the end of the file to be read includes the starting position of the metadata in the file to be read.
  • the computing node 210 When reading metadata, you can read a certain length of data directly from the end of the file to be read to determine the position of the header of the metadata in the file to be read, which can be the offset value of the metadata header, and then Set the read pointer to the offset value of the metadata header for reading, thereby obtaining the metadata of the file to be read.
  • the metadata is appended to the end of the file to be read, and the format of the file to be read containing the metadata may be as shown in FIG. 7.
  • metadata is appended to the end of the file to be read, and (13) check mask and (14) metadata header offset position are also appended to the end of the metadata.
  • the check mask is generally located before "(14) Metadata Head Offset Position", and is used for the computing node 210 to confirm the first position of (14).
  • the computing node 210 can read the file from the file to be read. Read a certain range of content in the reverse direction at the end of the to determine whether the content in the range has a check mask of the target format (13), if there is a check mask of the target format (13), you can continue to read (14)
  • the offset position of the metadata header is used for the computing node 210 to determine the position of the metadata header in the file to be read.
  • the offset position of the metadata header may be Line N+1.
  • the computing node 210 when the computing node 210 reads the file to be read, it can first set the read pointer to the end of the file, and then read the content in a certain range of the tail file in a reverse manner, and perform pattern matching on it to determine the Whether there is a check mask in the target format for the content in the range, if there is no check mask in the target format, the computing node 210 will read the file to be read using a data processing method commonly used in the industry, and if there is a check mask in the target format Mask, set the read pointer to the check mask, read the data in the forward direction to obtain the offset position information of the metadata header, then set the read pointer to the offset position, and then read the metadata.
  • the data calls multiple threads to read the file to be read concurrently.
  • the check mask can be "#HWBDFORMAT”
  • the offset position of the metadata information header can be #12345678.
  • the computing node 210 can first set the read pointer at the end of the file, and then Reversely read the content in a certain range of the tail file to determine whether the content in the range has the fixed format #HWBDFORMAT, if there is a check mask in this format, then read the (14 ) The metadata header offset position, and then set the pointer to the offset position "12345678" to start reading metadata.
  • the metadata storage method can be selected according to the application environment. It is understandable that the metadata is stored under the same file name in the storage path of the file to be read In this method, the data processing logic of the computing node does not need to be modified, and the reusability is strong, but it will increase the burden of file management on the storage node 220; and the method of directly appending the metadata to the end of the file to be read does not require Generate redundant files to facilitate file management of the storage node 220, but the data processing logic of the computing node needs to be modified so that the computing node can first read the metadata from the end of the file, and then read the file to be read based on the metadata.
  • the storage mode of metadata can be flexibly determined according to the application environment, so that the data processing method and the data processing method provided in this application are more widely used.
  • the storage node 220 parses the file to be read in advance, determines the metadata format of the file to be read according to the data type of the file to be read, and generates a metadata format for reading the file to be read. Read the metadata of the file, and then store the metadata of the file to be read, so that when the computing node reads the file, it can effectively initialize the data structure of the memory according to the metadata of the file to be read, and read the file to be read concurrently Fetch files to improve the efficiency of file reading.
  • metadata is highly scalable, and metadata can be further added and enriched according to various information required when reading various types of data, making the applicability of the solution provided by this application very broad .
  • the method for the computing node 210 to read the file to be read will be explained below.
  • the data processing method provided in this application can be applied to the computing node 210 of the data processing system 400 described in FIG. 4, as shown in FIG. 8, the method includes the following steps:
  • the computing node 210 obtains metadata of the file to be read from the storage node 220, where the metadata of the file to be read includes the number of slices, the number of rows, and the number of slices in the file to be read. starting point.
  • the storage node 220 stores metadata in a manner that the metadata of the file to be read is stored in the specified path of the storage node, or the metadata storage location of the file to be read is different from the metadata storage location of the file to be read.
  • step S810 may include the following steps: the computing node 210 obtains from the storage node 220 the common identifier of the file to be read, such as the file name of the file to be read, and then according to the file name of the file to be read, from the specified path or the file to be read Get the storage location of the file to obtain the metadata of the file to be read.
  • the metadata file exists, read the metadata file, apply for memory space based on the metadata file, create threads, and call threads to concurrently read the file to be read; if the metadata file does not exist, use the data processing common in the industry
  • the method for data processing is not specifically limited in this application.
  • the storage node 220 generates the file to be read dataA.exp and the corresponding metadata dataA.metadata, that is, the file to be read and the metadata are jointly identified as the same file name, and then the file to be read is identified as the same file name.
  • the read file and metadata are stored together in /pathA/pathB/.../pathN, then when the computing node 210 reads the file dataA.exp to be read, it can be stored in the storage path /pathA/pathB/.../pathN of dataA.exp Find the metadata with the same name as the file to be read, that is, dataA.metadata, or find whether the metadata file exists according to the storage path /pathA/pathB/.../pathNdataA.metadata, if the metadata file exists, read it File, and read information based on metadata. If the metadata file does not exist, data processing methods commonly used in the industry are used for data processing, and this application does not specifically limit it.
  • step S810 may include the following steps: Read the end of the file to obtain the starting position of the metadata in the file to be read, which can be specifically the offset value of the metadata header, and read the metadata according to the offset value of the metadata header.
  • the computing node 210 when the computing node 210 reads the file to be read in the format shown in FIG. 7, it can first set the read pointer to the end of the file, and then read the end in a reverse manner. The content within a certain range of the file, and pattern matching is performed to determine whether the content in the range has the (13) check mask of the target format. If there is no (13) check mask of the target format, the computing node 210 The file to be read will be read using the data processing method commonly used in the industry. If the (13) check mask of the target format exists, then (13) the metadata header offset after the check mask (14) will be read Position, set the read pointer to the offset position, and then read the metadata.
  • the computing node 210 can use the data processing method commonly used in the industry for data processing, and perform data analysis on the file to be read, and then return the analysis result To the storage node 220, so that the storage node 220 generates metadata of the file to be read according to the analysis result. In this way, when another computing node 210 reads the file to be read, the storage node 220 can return the metadata to the computing node 210 , So that the computing node concurrently reads the file to be read based on the metadata.
  • S820 The computing node calls multiple threads according to the starting position of each slice in the file to be read, and concurrently reads the data of each slice, where the multiple threads are created by the computing node according to the number of slices.
  • the number of threads y may be equal to the number of slices x.
  • each thread processes a slice, and y threads can read the file to be read in parallel to achieve an excellent processing state, which greatly improves the speed of the computing node to read the file, and further improves the processing of big data and AI tasks efficient.
  • the number of threads y may be less than the number of slices x.
  • the number of slices x of the file to be read is determined according to the hardware processing capability of the computing node 210, and when the computing node 210 reads the file to be read, the computing node 210 may currently partially process other matters. For example, a big data task or an AI task is in progress.
  • the number of threads y that the computing node 210 can create may be less than the number of slices x.
  • the computing node 210 can directly create 10 threads. Call 10 threads to read the slices of the file to be read in parallel to achieve the best processing state.
  • the computing node reads the file the fastest and has the highest processing efficiency; if the current 3 cores of the computing node 210 are processing big data tasks, only If the 7 cores are in an idle state, the computing node 210 can create 7 threads G1 to G7, and call the 7 threads to concurrently read 10 slices of the file to be read. It should be understood that the above examples are only for illustration, and this application does not make specific limitations.
  • the computing node stores the data of each slice in the memory space according to the order of the starting position of each slice in the file to be read, where the memory space is obtained by the computing node according to the number of rows.
  • the starting position of each slice in the file to be read can be the offset value and line number of the starting position of each slice in the file to be read. Therefore, each thread reads the value of the slice. After data is collected, multiple threads can be called to write multiple slices into the memory space concurrently according to the size sequence of the offset value of the start position of the slice or the size sequence of the row number.
  • one thread can process one slice first, and then after each thread finishes reading 1 slice, continue to read the next slice from the remaining slices until All slices have been read.
  • the computing node 210 creates 7 threads G1 to G7 to read the file to be read, and the number of slices of the file to be read is 10, then the threads G1 to G7 can concurrently read slices 1 to 1 first.
  • Slice 7, after thread 1 processes slice 1, continue to take a slice from the remaining slices for reading.
  • slice 8 is to be processed, then thread 1 processes slice 1 and continues to process slice 8, and other threads follow the same strategy. Execute until all slices are processed. It should be understood that the above examples are only for illustration, and this application does not make specific limitations.
  • the starting position of the slice can include the offset and line number of the starting position of each slice in the file to be read, and each thread can be based on the line number of the starting position of the slice to be read. And the line number of the starting position of the next slice to determine the length of the slice to be read, so that some threads can read multiple slices from the starting position of the current slice according to the length of the current slice and the length of the next slice .
  • the number of threads is 7 and the number of slices is 10, then 4 slices can be allocated for concurrent reading by thread 1 to thread 4, and 6 slices can be allocated for concurrent reading by thread 5 to thread 7.
  • Thread 5 can read from the start position of the fifth slice to the start position of the seventh slice
  • thread 6 can read from the start position of the seventh slice to the start position of the ninth slice
  • thread 7 It can be read from the beginning of the 9th slice to the end of the file.
  • each row of data is represented by L 1 to L 9 respectively.
  • the computing node 210 can apply for 3 threads G1 to G3 according to the number of slices 3, and then apply for a segment from the memory 109 according to the number of rows 9
  • the memory space n 0 accommodating 9 rows of data, and then call 3 threads to read the file to be read to the memory space n 0 concurrently.
  • thread G1 reads slice 1
  • thread G2 reads slice 2
  • thread G3 reads slice 3.
  • thread G1 determines slice 1 according to the row number 1 of slice 1 and the row number 4 of the next slice (slice 2).
  • the length of is 3 lines
  • thread G2 determines the length of slice 2 is 3 lines according to the line number 3 of slice 2 and the line number 7 of the next slice (slice 3)
  • thread G3 determines the length of slice 2 according to the line number 7 of slice 4 and the total number of rows 9.
  • the length of slice 3 is 3 lines
  • thread G1 sets the read pointer to the offset value w 1 and reads 3 lines of data L 1 ⁇ L 3 to the first three lines of the memory space n 0
  • thread G2 sets the read pointer to the offset Shift w 4 and read 3 rows of data L 4 ⁇ L 6 to rows 3 to 6 of the memory space n 0.
  • Thread G3 sets the read pointer to the offset value w 7 and reads 3 rows of data L 7 ⁇ L 9 to In the last three lines of the memory space n 0 , threads G1, G2, and G3 process the above tasks concurrently, thereby completing a concurrent file reading.
  • the storage node 220 generates metadata of the file to be read in advance before the computing node 210 reads the file to be read, so that the computing node 210 reads the file to be read from the storage node 220
  • the purpose of threads to read files concurrently not only avoids the problem of incorrect initialization of the memory space data structure and failure of data processing due to the inability to determine the data type, but also avoids multiple expansions of the memory space due to the inability to determine the number of lines of the file to be read.
  • the resulting waste of resources, and the ability to read files concurrently greatly improves the speed at which the computing node 210 reads files, and further improves the processing efficiency of big data and AI tasks.
  • the above steps S810 to S830 are the general data reading method provided by this application.
  • the metadata format of the file to be read is different for different data types, so the data reading process in different application scenarios has subtleties.
  • the following combines a specific application scenario, and the storage node 220 stores the file to be read and the corresponding metadata under the same file name under the same path, and the file to be read is stored in the same path.
  • the data type of the file is a dense matrix, and the metadata format is shown in FIG. 5 as an example.
  • the reading process of the aforementioned computing node 210 reading the file to be read according to the metadata is described in detail.
  • the process for the computing node 210 to obtain the metadata of the file to be read from the storage node 220 may be as follows:
  • step S1002 Search whether the metadata corresponding to the file to be read exists in the same path or the designated path according to the common identifier, if it exists, execute step S1003, if it does not exist, execute step S1011. Assuming that the metadata extension is metadata, you can search /pathA/pathB/pathC/.../pathN/data.metadata in the same path to determine whether the metadata dataA.metadata of the file dataA.exp to be read exists.
  • step S1004 Obtain (4) the check mask of the metadata file, and verify (4) the check mask. If the check mask is successfully verified, it indicates that the position is the head of the metadata file. Start to read the metadata file, that is, perform step S1005; if the verification mask fails to verify, it indicates that the position is not the head of the metadata file, and the computing node 210 can stop reading the metadata, and read the pending data through other methods. To read the file, step S1011 is executed.
  • step S1005 Obtain (5) the metadata check value and verify it. If the metadata check value is successfully verified, it indicates that the metadata has not been changed after being stored in the storage node 220, and the computing node 210 can follow The content in the metadata reads the file to be read, and continues to step S1006; if the metadata check value verification fails, it indicates that the metadata may have been changed due to data loss or other reasons, and the computing node 210 can stop reading Metadata, and step S1011 is executed.
  • the metadata check value can be generated according to certain rules according to the data length and other information when the metadata is stored. In this way, when the computing node 210 is reading the metadata, it can be based on the data length of the current metadata, etc. The information generates a check value for verification according to the same rules. If the check value is equal to (5) the metadata check value, it proves that the metadata information has not changed, and step S1006 can be continued; if not, It proves that the metadata information may have changed due to data loss and other reasons. It should be understood that the above (5) implementation of the metadata check value is only used for illustration, and this application does not specifically limit the check method of the metadata.
  • step S1006 Obtain (6) the file check value and verify it. If the file check value is successfully verified, it means that the file to be read has not been changed after storage. Continue to step S1007; file check value verification In the case of failure, it means that the file to be read may have been changed due to data loss or other reasons after being stored.
  • the computing node 210 may stop reading the file to be read and return the message that the reading has failed, that is, step S1012 is executed.
  • the computing node 210 may first determine whether the file check value is valid, so as to avoid that some storage nodes 220 do not generate the file check value, resulting in (6) the file check value part is a meaningless character string. Therefore, If the file check value is invalid, you can directly execute step S1007. If the check value is valid, the check value can be verified. If the file check value is successfully verified, continue to step S1007; file check value verification In the case of failure, the computing node 210 may stop reading the file to be read, and return information that the reading has failed, that is, step S1012 is executed.
  • step S1007 Obtain (7) metadata format version, (8) file format version, and (9) data type, for example, the format version is V1, the file format is CSV, and the data type is dense matrix, to determine whether the current computing node 210 supports processing
  • the metadata format version is V1
  • the file format is CSV
  • the data type is a file to be read with a dense matrix. If it is supported, the computing node 210 may execute step S1008, and if it is not supported, step S1011 may be executed.
  • S1008 Apply for memory space for loading the file to be read according to the number of rows (1), and initialize the data structure of the memory space according to (10) the characteristic value type.
  • the computing node 210 obtains the number of slices (2) as x, and creates y threads according to the number of cores currently owned by the processor and the processing capability of the processor, where y is less than or equal to x.
  • y is less than or equal to x.
  • thread 1 can read slice 1
  • thread 2 can read slice 2
  • so on so that multiple threads can read multiple slices in parallel, which greatly improves the reading of the file to be read
  • the processing efficiency of the entire big data or AI task is improved.
  • the number of threads is less than the number of slices, for example, the number of threads is 8 and the number of slices is 16, then one thread processes a slice first, and after each thread processes the current slice, it continues to take a slice from the remaining slices to continue processing, such as thread 1. After processing slice 1, slice 9 is to be processed, thread 1 can continue to process slice 9, and other threads can also execute according to the same strategy until all slices are processed. Specifically, the above process can be achieved through round-robin scheduling. , I won’t go into details here.
  • all the slices can be directly allocated to all threads. Still taking the above example as an example, the number of threads is 8 but the number of slices is 16. The length of each slice is l 1 ⁇ l 16 , and then thread 1 is allocated to read slice 1 to 2. Thread 1 reads data of length l 1 +l 2 from the starting position of slice 1, and reads slice 1 and slice 2. To the memory space, thread 2 directly reads data of length l 3 +l 4 from the starting position of slice 3, reads slice 3 and slice 4 to the memory space, etc., which are not specifically limited in this application.
  • the computing node 210 uses other methods to read the file to be read, such as other data processing methods commonly used in the industry, which are not specifically limited here.
  • S1012 The computing node 210 stops reading the file to be read, and returns information that there is an error in the data of the file to be read and the reading has failed.
  • the foregoing data processing method stores the metadata of the file to be read in the storage node 220 in advance, so that when the computing node 210 reads the file to be read from the storage node 220, the memory space can be effectively initialized according to the metadata.
  • the computing node 210 reads the file to be read from the storage node 220
  • the memory space can be effectively initialized according to the metadata.
  • Read files improve the efficiency of data reading, and then improve the processing efficiency of the entire AI task and big data tasks.
  • more information can be added to the metadata to increase functional requirements such as data security and reliability, and it is highly scalable.
  • the storage node 220 stores the metadata at the end of the file to be read in the manner shown in FIG.
  • the data type of the file is taken as a sparse matrix, and the metadata format is shown in FIG. 6 as an example.
  • the reading process of the aforementioned computing node 210 reading the file to be read according to the metadata is described in detail.
  • the process for the computing node 210 to obtain the metadata of the file to be read from the storage node 220 may be as follows:
  • step S1103 Reversely read the content in a certain range of the tail file, and determine whether there is a matching format (that is, the format of (13) check mask) in the content within the range. If it exists, prove that the position is a meta
  • the (13) check mask of the data can execute step S1104. If it does not exist, it means that no metadata is added to the file, and the computing node 210 can use a general data processing method for data processing, that is, step S1112 is executed.
  • step S1105 Obtain (4) the check mask in the metadata, and perform a second verification of (4) the check mask to further confirm whether the position is the head position of the metadata, and the check mask is successfully verified
  • step S1106 is executed; when the verification of the check mask fails, step S1112 is executed.
  • step S1004 please refer to the aforementioned step S1004, which will not be repeated here.
  • step S1106 Obtain (5) the metadata check value and verify it. If the metadata check value is successfully verified, proceed to step S1107; if the metadata check value fails to verify, perform step S1112 .
  • step S1107 if the metadata check value fails to verify, perform step S1112 .
  • step S1107 Obtain the file check value (6) and verify it. If the file check value is successfully verified, it means that the file to be read has not been changed after being stored. Continue to step S1108; file check value verification In the case of failure, it means that the file to be read may have been changed due to data loss or other reasons after storage, and the computing node 210 may stop reading the file to be read, and step S1113 is executed. For details, please refer to the aforementioned step 1012, which will not be repeated here.
  • S1108 Obtain (7) metadata format version, (8) file format version, and (9) data type, for example, the format version is V2, the file format is CSV, and the data type is sparse matrix, to determine whether the current computing node 210 supports processing
  • the metadata format version is V2
  • the file format is CSV
  • the data type is a file to be read with a sparse matrix. If it is supported, the computing node 210 can perform step S1109, and if it is not supported, step S1112 is performed.
  • S1109 Apply for memory space for storing data values and data column indexes according to the number of rows (10), and apply for memory space for storing row data according to (1) the number of rows.
  • the computing node 210 obtains the number of slices (2) as x, and then creates y threads according to the number of cores currently owned by the processor and the processing capability of the processor, where y is not less than or equal to x.
  • y is not less than or equal to x.
  • step S1111 Each thread concurrently reads multiple slices of the file to be read into the memory space. For details, please refer to step S1010 of the foregoing content, and details are not repeated here.
  • a file to be read whose data type is a sparse matrix
  • the computing node 210 calls multiple threads to read the file to be read concurrently, it can be based on the starting position of the data column index of each slice and the value of each slice.
  • the starting position of the data value and the starting position of the row data volume of each slice call multiple threads to read the data value of each slice and the index of the data column of each slice to the first memory space, call multiple threads to read concurrently Take the row data amount of each slice to the second memory space to obtain the file to be read.
  • the computing node 210 needs to convert the sparse matrix into a dense matrix, and then load it into the memory space. Therefore, each thread can be based on the metadata in the (1) The number of rows, (12) the number of columns, (10) the number of values and other information are converted into a dense matrix and then written into the memory space. For details, please refer to the embodiment in FIG. 6, and details are not repeated here.
  • the computing node 210 needs to read the entire file to be read, and then first parse out the number of rows, columns, and values of the file to be read. , And then convert the sparse matrix into a dense matrix.
  • multiple threads can directly convert the slices into a dense matrix when reading the slices concurrently according to the number of rows, columns, and values in the metadata.
  • the format is written into the memory space, thereby avoiding the process of converting all the sparse matrices into a dense matrix after reading all the sparse matrices, and improving the reading efficiency of the file to be read of the sparse matrix data type.
  • the computing node 210 uses other methods to read the file to be read, such as other data processing methods commonly used in the industry, which are not specifically limited here.
  • S1113 The computing node 210 stops reading the file to be read, and returns information that the data of the file to be read has an error and the reading has failed.
  • the foregoing data processing method stores metadata of the file to be read in the storage node 220 in advance, so that when the computing node 210 reads the file to be read from the storage node 220, the memory space can be effectively initialized according to the metadata.
  • the memory space used to store the files to be read at one time based on the metadata avoid the waste of resource occupation caused by multiple expansions of the memory space, and you can also read concurrently based on the metadata Take the file to be read to improve the efficiency of data reading, thereby improving the processing efficiency of the entire AI task and big data task.
  • the file to be read with the data type of sparse matrix can be directly converted into a dense matrix and loaded into The memory improves the efficiency of reading the sparse matrix, and the metadata can be appended with more information to adapt to the reading of more types of data files, which makes the data processing method applicable to a very wide range.
  • FIG. 12 is a schematic structural diagram of a computing node 210 provided by the present application.
  • the computing node 210 is applied to the data processing system 400 shown in FIG. 3, and the computing node 210 includes:
  • the metadata reading unit 211 is configured to obtain metadata of the file to be read, where the metadata of the file to be read includes the number of slices, the number of rows, and the number of slices in the file to be read. starting point;
  • the slice reading unit 212 is configured to call multiple threads according to the starting position of each slice in the file to be read, and concurrently read the data of each slice, wherein the multiple threads are created by the computing node according to the number of slices ;
  • the slice reading unit 212 is further configured to store the data of each slice in the memory space according to the order of the starting position of each slice in the file to be read, where the memory space is obtained by the computing node according to the number of rows.
  • the metadata of the file to be read is generated according to the metadata format and the file to be read after the storage node determines the metadata format of the file to be read according to the data type of the file to be read, where different data The metadata format of the file to be read is different.
  • the metadata of the file to be read is stored in the file to be read
  • the end of the file to be read includes the starting position of the metadata in the file to be read
  • the metadata reading unit 211 is used to read from the file to be read. The end of the file is taken to obtain the starting position of the metadata in the file to be read; the metadata reading unit 211 is configured to read the metadata of the file to be read according to the starting position of the metadata in the file to be read.
  • the metadata of the file to be read is stored in a designated path of the storage node.
  • the metadata storage location of the file to be read is the same as the storage location of the file to be read.
  • the file to be read and the metadata of the file to be read include a common identification
  • the metadata reading unit 211 is used to obtain the common identification of the file to be read from the storage node
  • the metadata reading unit 211 is used to obtain the common identification of the file to be read according to the Read the common identification of the file, and obtain the metadata of the file to be read from the specified path or the storage location of the file to be read.
  • the metadata of the file to be read includes verification information.
  • the verification information is used to verify whether the metadata of the file to be read has changed after being stored in the storage node.
  • the slice reading unit 212 is used to Each slice is at the starting position in the file to be read, multiple threads are called, and before the data of each slice is read concurrently, the metadata of the file to be read is verified according to the verification information whether it occurs after it is stored in the storage node.
  • the slice reading unit 212 is used to call multiple slices according to the starting position of each slice in the file to be read when the metadata of the file to be read has not changed after being stored in the storage node Thread, read the data of each slice concurrently.
  • the metadata of the file to be read also includes a data type.
  • the metadata also includes a feature value type.
  • the feature value type is used for the computing node to initialize the data of the memory space according to the feature value type.
  • the slice reading unit 212 is used to initialize the data structure of the memory space according to the data type before calling multiple threads according to the starting position of each slice in the file to be read, and reading the data of each slice concurrently.
  • the file to be read includes the data value, data column index, and row data amount.
  • the metadata also includes the number of values, and the number of values is used to apply for storing data values and data column indexes.
  • the slice reading unit 212 is used to call multiple threads according to the starting position of each slice in the file to be read, and before concurrently reading each slice, apply for storing data values according to the number of values And the first memory space of the data column index; the slice reading unit is used to apply for the second memory space for storing the amount of row data according to the number of rows, and obtain the second memory space for storing the file to be read according to the first memory space and the second memory space Memory space.
  • the starting position of each slice in the file to be read includes the starting position of the data column index of each slice, the starting position of the data value of each slice, and each The starting position of the row data amount of the slice; the slice reading unit 211 is used to store the data of each slice in the memory space according to the order of the starting position of each slice in the file to be read, according to each slice
  • the order of the starting position of the data column index and the starting position of the data value of each slice, the data column index and data value of each slice are stored in the first memory space, according to the amount of row data of each slice In the order of the starting position, the amount of row data of each slice is stored in the second memory space.
  • the computing node 210 in the embodiment of the present application may be implemented by an application-specific integrated circuit (ASIC) or a programmable logic device (PLD), and the above PLD may be complex program logic.
  • ASIC application-specific integrated circuit
  • PLD programmable logic device
  • Device complex programmable logical device, CPLD
  • field-programmable gate array field-programmable gate array
  • FPGA field-programmable gate array
  • GAL generic array logic
  • the computing node 210 may correspond to executing the method described in the embodiment of the present application, and the foregoing and other operations and/or functions of each unit in the computing node 210 are to implement each of FIGS. 1 to 11. For the sake of brevity, the corresponding process of the method will not be repeated here.
  • this application provides that when a computing node performs data reading, the storage node 220 generates the metadata of the file to be read in advance before the computing node 210 reads the file to be read, so that the computing node 210 obtains data from the storage node.
  • the length of the file to be read, the number of slices, and the starting position of each slice in the file to be read can be determined according to the metadata of the file to be read, so as to achieve a one-time application Memory space, the purpose of multiple threads concurrently reading files, not only avoids the problem of incorrect initialization of the memory space data structure and failure of data processing due to the inability to determine the data type, but also avoids the inability to determine the number of lines in the file to be read.
  • the waste of resources caused by multiple expansions of the memory space, and the ability to read files concurrently greatly improves the speed at which the computing node 210 reads files, and further improves the processing efficiency of big data and AI tasks.
  • FIG. 13 is a schematic structural diagram of a server 1300 provided by an embodiment of this application.
  • the server 1300 may be the computing node 210 and the storage node 220 in the embodiment of FIG. 1 to FIG. 11.
  • the server 1300 includes a processor 1310, a communication interface 1320, and a memory 1330.
  • the processor 1310, the communication interface 1320, and the memory 1330 may be connected to each other through an internal bus 1340, and may also communicate through other means such as wireless transmission.
  • the embodiment of the present application takes the connection via the bus 1340 as an example.
  • the bus 1340 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus.
  • PCI peripheral component interconnect
  • EISA extended industry standard architecture
  • the bus 1340 can be divided into an address bus, a data bus, a control bus, and so on. For ease of representation, only one thick line is used in FIG. 13, but it does not mean that there is only one bus or one type of bus.
  • the processor 1310 may be constituted by at least one general-purpose processor, such as a CPU, or a combination of a CPU and a hardware chip.
  • the above-mentioned hardware chip may be ASIC, PLD or a combination thereof.
  • the above-mentioned PLD can be CPLD, FPGA, GAL or any combination thereof.
  • the processor 1310 executes various types of digital storage instructions, such as software or firmware programs stored in the memory 1330, which enables the computing node 210 to provide various services.
  • the processor 1310 may be a multi-core processor shown in FIG. 1 or a multi-CPU multi-core processor, which is not specifically limited in this application.
  • the memory 1330 is used to store program codes, which are controlled to execute by the processor 1310, so as to execute the processing steps of the computing node 210 in any of the embodiments in FIG. 1 to FIG. 11 described above.
  • the program code may include one or more software modules, and the one or more software modules may be software units of the computing node 210 provided in the embodiment of FIG. 1, such as a metadata reading unit, a slice reading unit, etc.
  • the metadata reading unit is used to obtain the metadata of the file to be read from the storage node; the slice reading unit is used to create multiple threads according to the number of slices and the processing capacity of the processor of the computing node, and apply for Store the memory space of the file to be read; the slice reading unit is also used to call multiple threads according to the starting position of each slice in the file to be read, and concurrently read each slice to the memory space to obtain the file to be read document. Specifically, it can be used to execute S810-step S830 and its optional steps in the embodiment of FIG. 8 and FIG. 9, step S1001 to step S1012 and its optional steps in the embodiment of FIG. 10, and step S1101 to step in the embodiment of FIG. 11 S1113 and its optional steps can also be used to perform other steps performed by the computing node 210 described in the embodiments in FIG. 1 to FIG. 11, and details are not described herein again.
  • the memory 1330 is used to store program codes, which are controlled by the processor 1310 to execute, so as to execute the processing steps of the storage node 210 in any of the embodiments in FIG. 1 to FIG. 11 described above.
  • the program code may include one or more software modules.
  • the one or more software modules may be a software unit of the storage node 220 provided in the embodiment of FIG.
  • the node 220 obtains the metadata of the file to be read according to the file to be read.
  • the metadata of the file to be read includes the number of slices, the number of rows, and the starting position of each slice in the file to be read. . Specifically, it can be used to perform S510-step S520 and optional steps in the embodiment of FIG. 5, and can also be used to perform other steps performed by the storage node 220 described in the embodiments of FIG.
  • the memory 1330 may include a volatile memory (volatile memory), such as a random access memory (random access memory, RAM); the memory 1030 may also include a non-volatile memory (non-volatile memory), such as a read-only memory (read-only memory). Only memory (ROM), flash memory (flash memory), hard disk drive (HDD) or solid-state drive (SSD); memory 1330 may also include a combination of the above types.
  • the memory also stores program code.
  • the server 1300 is the computing node 210, it may specifically include program code for executing the steps performed by the computing node described in the embodiments of FIG. 1 to FIG. 11.
  • the server 1300 is the storage node 220 In this case, it may specifically include program code for executing the steps performed by the storage node described in the embodiments of FIG. 1 to FIG. 11, and store the file to be read and the metadata of the file to be read.
  • the communication interface 1320 may be a wired interface (such as an Ethernet interface), an internal interface (such as a high-speed serial computer expansion bus (peripheral component interconnect express, PCIe) bus interface), a wired interface (such as an Ethernet interface), or a wireless interface (for example, a cellular network interface or the use of a wireless local area network interface) to communicate with other devices or modules.
  • a wired interface such as an Ethernet interface
  • an internal interface such as a high-speed serial computer expansion bus (peripheral component interconnect express, PCIe) bus interface
  • PCIe peripheral component interconnect express
  • Ethernet interface such as an Ethernet interface
  • a wireless interface for example, a cellular network interface or the use of a wireless local area network interface
  • the present embodiment may be a common physical server implementations, e.g., the ARM server or X86 server may be a common physical servers based on a combination NFV technology virtual machine implementation, the virtual machine refers to a software simulation a complete hardware system functions, in a computer system running a full completely isolated environment, such as in the present embodiment may be implemented on a cloud computing infrastructure.
  • the ARM server or X86 server may be a common physical servers based on a combination NFV technology virtual machine implementation
  • the virtual machine refers to a software simulation a complete hardware system functions, in a computer system running a full completely isolated environment, such as in the present embodiment may be implemented on a cloud computing infrastructure.
  • FIG. 13 is only a possible implementation of the embodiment of the present application.
  • the server 1300 may also include more or fewer components, which is not limited here.
  • Regarding the content that is not shown or described in the embodiments of the present application please refer to the relevant descriptions in the foregoing embodiments of FIG. 1 to FIG. 11, which will not be repeated here.
  • server shown in FIG. 13 may also be a computer cluster composed of at least one physical server, which is not specifically limited in this application.
  • FIG. 14 is a storage array 1400 provided by the present application.
  • the storage array 1400 may be the storage node 220 of the foregoing content.
  • the storage array 1400 includes a storage controller 1410 and at least one storage 1420, where the storage controller 1410 and the at least one storage 1420 are connected to each other through a bus 1430.
  • the storage controller 1410 includes one or more general-purpose processors, where the general-purpose processor can be any type of device capable of processing electronic instructions, including a CPU, a microprocessor, a microcontroller, a main processor, a controller, and an ASIC, etc. Wait.
  • the processor 1410 executes various types of digital storage instructions, such as software or firmware programs stored in the memory 1420, which enables the storage array 1400 to provide multiple services.
  • the memory 1420 is used to store program codes, and is controlled by the storage controller 1410 to execute, so as to execute the processing steps of the storage node 210 in any one of the embodiments in FIG. 1 to FIG. 11 described above.
  • the program code may include one or more software modules.
  • the one or more software modules may be a software unit of the storage node 220 provided in the embodiment of FIG.
  • the node 220 obtains the metadata of the file to be read according to the file to be read.
  • the metadata of the file to be read includes the number of slices, the number of rows, and the starting position of each slice in the file to be read. . Specifically, it can be used to perform steps S510 to S520 and optional steps in the embodiment of FIG.
  • the memory 1420 is also used to store program data.
  • the program data includes the file to be read and the metadata of the file to be read.
  • FIG. 14 takes the program code stored in the memory 1 and the program data stored in the memory n as an example for illustration, which is not limited in this application.
  • the memory 1420 may be a non-volatile memory, such as ROM, flash memory, HDD, or SSD memory, and may also include a combination of the foregoing types of memory.
  • the storage array 1400 may be composed of multiple HDDs or multiple SDDs, or the storage array 1400 may be composed of multiple HDDs and ROMs.
  • at least one memory 1420 is combined in different ways with the assistance of the memory controller 1410 to form a memory group, thereby providing higher storage performance than a single memory and providing data backup technology.
  • the storage array 1400 shown in FIG. 14 may also be one or more data centers composed of at least one storage array, and the above-mentioned one or more data centers may be located at the same location, or at different locations. There are no specific restrictions.
  • FIG. 14 is only a possible implementation of the embodiment of the present application.
  • the storage array 1400 may also include more or fewer components, which is not limited here.
  • Regarding the content that is not shown or described in the embodiments of the present application please refer to the relevant descriptions in the foregoing embodiments of FIG. 1 to FIG. 11, which will not be repeated here.
  • This application also provides a system including the server 1300 described in FIG. 13 and the storage array 1400 described in FIG. I won't repeat them here.
  • the embodiment of the present application also provides a computer-readable storage medium, which stores instructions in the computer-readable storage medium, and when it runs on a processor, the method flow shown in FIG. 1 to FIG. 11 is implemented.
  • the embodiment of the present application also provides a computer program product.
  • the computer program product runs on a processor, the method flow shown in FIG. 1 to FIG. 11 can be realized.
  • the foregoing embodiments may be implemented in whole or in part by software, hardware, firmware or any other combination.
  • the above-mentioned embodiments may be implemented in the form of a computer program product in whole or in part.
  • the computer program product includes at least one computer instruction.
  • the computer program instructions When the computer program instructions are loaded or executed on the computer, the processes or functions according to the embodiments of the present invention are generated in whole or in part.
  • the computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • Computer instructions can be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium.
  • computer instructions can be transmitted from a website, computer, server, or data center through a cable (such as Coaxial cable, optical fiber, Digital Subscriber Line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) to transmit to another website, computer, server or data center.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage node such as a server or a data center that includes at least one set of available media.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium (for example, a high-density digital video disc (Digital Video Disc, DVD)), or a semiconductor medium.
  • the semiconductor medium may be an SSD.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A data processing method and system. The data processing method is applied to the data processing system, and the data processing system comprises a computing node and a storage node. The data processing method comprises the following steps: a computing node acquiring metadata of a file to be read (S810); then, according to the starting position of each slice in said file, calling multiple threads, and concurrently reading data of each slice (S820); and finally, according to the order of the starting positions of the slices in said file, storing the data of each shard slice in a memory space (S830). By means of the method, when a computing node reads a file to be read, a memory space capable of accommodating said file can be applied for in one step according to metadata of said file, and said file can be concurrently read to improve the efficiency of data reading, thereby improving the efficiency of processing the whole AI or big data task.

Description

数据处理方法及系统Data processing method and system 技术领域Technical field
本申请涉及计算机领域,尤其涉及一种数据处理方法及系统。This application relates to the computer field, and in particular to a data processing method and system.
背景技术Background technique
随着科学技术的不断发展,信息爆炸时代产生的海量数据已经渗透到当今每一个行业和业务职能领域,大数据(big data)和人工智能(artificial intelligence,AI)领域也随之得到了发展,成为两个非常热门的研究方向。With the continuous development of science and technology, the massive amount of data generated in the era of information explosion has penetrated into every industry and business functional area today, and the fields of big data (big data) and artificial intelligence (AI) have also been developed. Become two very popular research directions.
计算节点在执行大数据或者AI任务时,需要先将其他设备或者平台上的数据文件加载在计算节点的内存中,再由计算节点根据内存中的数据来完成大数据或者AI任务的相关运算处理。但是,由于数据量大且不能并发读取文件,计算节点读取文件的效率很低,而且计算节点将数据文件加载在内存中所需的时间甚至超过了计算节点根据内存中的数据完成大数据或者AI任务所需的时间,严重影响了大数据或者AI任务效率。When a computing node performs big data or AI tasks, it needs to load data files on other devices or platforms into the memory of the computing node, and then the computing node completes the relevant calculation processing of the big data or AI tasks based on the data in the memory . However, due to the large amount of data and the inability to read files concurrently, the efficiency of the computing node to read the file is very low, and the time required for the computing node to load the data file into the memory even exceeds the computing node to complete the big data based on the data in the memory Or the time required for AI tasks seriously affects the efficiency of big data or AI tasks.
发明内容Summary of the invention
本申请提供了一种数据处理方法及系统,可以提高计算节点读取文件的效率。This application provides a data processing method and system, which can improve the efficiency of reading files by computing nodes.
第一方面,提供了一种数据处理方法,应用于数据处理系统,该数据处理系统包括计算节点和存储节点,上述数据处理方法包括以下步骤:计算节点获取待读取文件的元数据,该元数据包括待读取文件的行数以及每个切片在待读取文件中的起始位置,然后,根据元数据中每个切片在待读取文件中的起始位置,并发读取每个切片的数据,最后按照每个切片在待读取文件中的起始位置的顺序,将每个切片的数据存储至内存空间,其中,内存空间是根据元数据中的行数申请得到的。In the first aspect, a data processing method is provided, which is applied to a data processing system. The data processing system includes a computing node and a storage node. The data processing method includes the following steps: the computing node obtains metadata of a file to be read, and The data includes the number of lines in the file to be read and the starting position of each slice in the file to be read. Then, according to the starting position of each slice in the file to be read in the metadata, each slice is read concurrently Finally, according to the order of the starting position of each slice in the file to be read, the data of each slice is stored in the memory space, where the memory space is requested according to the number of rows in the metadata.
由于存储节点提前生成了待读取文件的元数据,使得计算节点读取待读取文件时,可以根据待读取文件的元数据获得待读取文件的行数以及每个切片在待读取文件中的起始位置,从而达到一次性申请内存空间,多个线程并发读取文件的目的,避免了由于无法确定待读取文件的行数导致多次扩充内存空间造成的资源浪费,又可以并发读取文件,使得计算节点读取文件的速度得到极大提升,进一步提升大数据和AI任务的处理效率。Since the storage node generates the metadata of the file to be read in advance, when the computing node reads the file to be read, it can obtain the number of rows of the file to be read and the number of rows in the file to be read according to the metadata of the file to be read. The starting position in the file, so as to achieve the purpose of one-time application of memory space and multiple threads to read the file concurrently, avoiding the waste of resources caused by multiple expansions of the memory space due to the inability to determine the number of lines of the file to be read. Concurrent reading of files greatly improves the speed at which computing nodes read files, and further improves the processing efficiency of big data and AI tasks.
在一种可能的实现方式中,待读取文件的元数据还可以包括切片数量,则在计算节点根据每个切片在待读取文件中的起始位置,并发读取每个切片的数据之前,计算节点可以根据切片数量创建多个线程,然后计算节点再调用多个线程并发读取每个切片的数据。简单来说,存储节点在生成元数据时,可以根据计算节点的硬件处理能力确定切片数量x,计算节点在读取元数据时,将根据该切片数量x与计算节点当前的处理能力,创建y个线程,并调用该y个线程并发读取x个切片。In a possible implementation manner, the metadata of the file to be read may also include the number of slices, so before the computing node concurrently reads the data of each slice according to the starting position of each slice in the file to be read , The computing node can create multiple threads according to the number of slices, and then the computing node can call multiple threads to read the data of each slice concurrently. Simply put, when a storage node generates metadata, it can determine the number of slices x based on the hardware processing capacity of the computing node. When the computing node reads metadata, it will create y based on the number of slices x and the current processing capacity of the computing node. Threads, and call the y threads to read x slices concurrently.
可选地,多个线程的数量y可以等于切片数量x。此时每个线程处理一个切片,y个线程可以并行读取待读取文件,达到极佳的处理状态,使得计算节点读取文件的速度得到极大提升,进一步提升大数据和AI任务的处理效率。Optionally, the number y of multiple threads may be equal to the number of slices x. At this time, each thread processes a slice, and y threads can read the file to be read in parallel to achieve an excellent processing state, which greatly improves the speed of the computing node to read the file, and further improves the processing of big data and AI tasks efficient.
可选地,多个线程的数量y可以小于切片数量x,创建的线程数量少于切片数量的情况下,可以先一个线程处理一个切片,然后在每个线程读取完1个切片后,继续从剩余切片中 读取下一个切片,直至所有切片被读取完毕。还可以部分线程只处理一个切片,部分线程处理多个切片,对于需要处理p个切片的线程来说,可以直接从当前切片的起始位置读取至第p+1个切片的起始位置,从而达到一个线程处理多个切片的目的,进而实现线程数量少于切片数量的情况下并发读取待读取文件多个切片目的。Optionally, the number y of multiple threads can be less than the number of slices x. When the number of threads created is less than the number of slices, one thread can process one slice first, and then after each thread finishes reading 1 slice, continue Read the next slice from the remaining slices until all slices have been read. It is also possible that some threads process only one slice, and some threads process multiple slices. For threads that need to process p slices, they can read directly from the starting position of the current slice to the starting position of the p+1th slice. In this way, the purpose of processing multiple slices by one thread is achieved, and the purpose of concurrently reading multiple slices of the file to be read is realized when the number of threads is less than the number of slices.
计算节点可以根据当前的处理能力灵活创建线程数量,如果处理器当前可创建的线程数量等于切片数量,此时可以调用多个线程并行读取待读取文件的多个切片,其中,一个线程只处理一个切,达到最佳的处理状态,极大地提升计算节点读取文件的效率;如果处理器当前可创建的线程数量低于线程数量,此时可以并发读取待读取文件的多个切片,一个线程可以处理多个切片,避免了由于计算节点当前负载重,处理能力降低导致并发读取失败的可能,线程数量减少也不会对并发读取文件产生影响,确保方案实现的可行性。在一种可能的实现方式中,待读取文件的元数据是存储节点根据待读取文件的数据类型确定待读取文件的元数据格式后,根据元数据格式和待读取文件生成的,其中,不同的数据类型的待读取文件的元数据格式不同。The computing node can flexibly create the number of threads according to the current processing capacity. If the number of threads that the processor can currently create is equal to the number of slices, then multiple threads can be called to read multiple slices of the file to be read in parallel, of which one thread only Process a slice to achieve the best processing state, which greatly improves the efficiency of the computing node to read the file; if the number of threads that the processor can currently create is lower than the number of threads, then multiple slices of the file to be read can be read concurrently , One thread can process multiple slices, avoiding the possibility of concurrent read failure due to the current heavy load of the computing node and the reduction of processing capacity. The reduction in the number of threads will not affect the concurrent reading of files, ensuring the feasibility of the solution. In a possible implementation, the metadata of the file to be read is generated according to the metadata format and the file to be read after the storage node determines the metadata format of the file to be read according to the data type of the file to be read, Among them, different data types have different metadata formats for the files to be read.
存储节点提前对待读取文件进行解析,根据待读取文件的数据类型确定待读取文件的元数据格式,生成用于读取该待读取文件的元数据,然后将待读取文件的元数据存储下来,使得计算节点在读取文件时,可以根据待读取文件的元数据,有效初始化内存的数据结构,并发读取待读取文件,提高文件的读取效率。并且,元数据的可扩展性很强,可以根据各种类型的数据在读取时所需的各种信息,对元数据进行进一步的追加和丰富,使得本申请提供的方案的适用性非常广泛。The storage node parses the file to be read in advance, determines the metadata format of the file to be read according to the data type of the file to be read, generates metadata for reading the file to be read, and then converts the metadata of the file to be read The data is stored, so that when the computing node reads the file, it can effectively initialize the data structure of the memory according to the metadata of the file to be read, and read the file to be read concurrently, thereby improving the efficiency of reading the file. In addition, metadata is highly scalable, and metadata can be further added and enriched according to various information required when reading various types of data, making the applicability of the solution provided by this application very broad .
在另一种可能的实现方式中,待读取文件的元数据存储于待读取文件中,待读取文件的末尾包括元数据在待读取文件中的起始位置,这样,计算节点从存储节点获取待读取文件的元数据时,可以从待读取文件的末尾获得元数据在待读取文件中的起始位置,然后根据元数据在待读取文件中的起始位置,读取待读取文件的元数据。In another possible implementation, the metadata of the file to be read is stored in the file to be read, and the end of the file to be read includes the starting position of the metadata in the file to be read. In this way, the computing node When the storage node obtains the metadata of the file to be read, it can obtain the starting position of the metadata in the file to be read from the end of the file to be read, and then read according to the starting position of the metadata in the file to be read. Get the metadata of the file to be read.
可选地,待读取文件的元数据可以存储在该待读取文件的尾部,而在待读取文件的最尾端写有元数据头部偏移位置和校验掩码,其中,校验掩码位于元数据头部偏移位置之前,使得计算节点读取元数据时,可以将读取指针设置于文件尾部,然后逆向读取一定范围的内容,确定该范围内的数据是否存在校验掩码,如果存在校验掩码则将指针设置于校验掩码处,正向读取元数据头部偏移位置,然后将读取指针设置于该元数据头部偏移位置,正向读取数据获得该元数据。Optionally, the metadata of the file to be read may be stored at the end of the file to be read, and the offset position of the metadata header and the check mask are written at the end of the file to be read. The verification mask is located before the offset of the metadata header, so that when the computing node reads the metadata, it can set the read pointer at the end of the file, and then read a certain range of content in reverse to determine whether the data in the range has a correction. If there is a check mask, set the pointer at the check mask, read the offset position of the metadata header in the forward direction, and then set the read pointer at the offset position of the metadata header. Obtain the metadata from the read data.
将待读取文件的元数据存储于待读取文件中,计算节点可以从待读取文件的末尾获得元数据在待读取文件中的起始位置,然后进行元数据的读取,而无需存储节点额外划分资源来存储待读取文件的元数据,便于存储节点的文件管理,减轻存储节点的管理负担。Store the metadata of the file to be read in the file to be read, the computing node can obtain the starting position of the metadata in the file to be read from the end of the file to be read, and then read the metadata without The storage node additionally divides resources to store the metadata of the file to be read, which facilitates file management of the storage node and reduces the management burden of the storage node.
在另一种可能的实现方式中,待读取文件的元数据存储于存储节点的指定路径。In another possible implementation manner, the metadata of the file to be read is stored in a designated path of the storage node.
可选地,待读取文件的元数据存储位置与待读取文件的存储位置相同。Optionally, the metadata storage location of the file to be read is the same as the storage location of the file to be read.
具体实现中,待读取文件和待读取文件的元数据包括共同标识,计算节点从存储节点获取待读取文件的元数据包括:计算节点从存储节点获取待读取文件的共同标识;计算节点根据待读取文件的共同标识,从指定路径或者待读取文件的存储位置获取待读取文件的元数据。In a specific implementation, the file to be read and the metadata of the file to be read include a common identification, and the computing node obtaining the metadata of the file to be read from the storage node includes: the computing node obtains the common identification of the file to be read from the storage node; computing The node obtains the metadata of the file to be read from the designated path or the storage location of the file to be read according to the common identification of the file to be read.
存储节点为待读取文件和对应的元数据设置共同标识后,将元数据存储在指定路径或者待读取文件的存储位置下,这样,计算节点在读取元数据时,可以根据共同标识从上述指定路径或者待读取文件的存储位置中获取元数据,而无需修改文件的读取逻辑,可适用于更多的计算节点。After the storage node sets a common identifier for the file to be read and the corresponding metadata, it stores the metadata in the specified path or the storage location of the file to be read. In this way, when the computing node reads the metadata, it can use the common identifier from The metadata is obtained from the specified path or the storage location of the file to be read without modifying the reading logic of the file, which can be applied to more computing nodes.
在另一种可能的实现方式中,待读取文件的元数据包括校验信息,校验信息用于校验待读取文件的元数据存储至存储节点之后是否发生过变化,计算节点可以在根据每个切片在待读取文件中的起始位置,调用多个线程,并发读取每个切片的数据之前,使用该校验信息对元数据进行校验,确认元数据在存储进存储节点之后,没有发生过数据丢失、损坏的情况下,再根据元数据并发读取待读取文件。具体地,计算节点可以在根据每个切片在待读取文件中的起始位置,调用多个线程,并发读取每个切片的数据之前,上述方法还包括以下步骤:计算节点根据校验信息校验待读取文件的元数据在存储至存储节点后是否发生过变化,在待读取文件的元数据存储至存储节点之后未发生过变化的情况下,根据每个切片在待读取文件中的起始位置,调用多个线程,并发读取每个切片的数据。In another possible implementation, the metadata of the file to be read includes verification information. The verification information is used to verify whether the metadata of the file to be read has changed after being stored in the storage node. The computing node can According to the starting position of each slice in the file to be read, multiple threads are called, and before the data of each slice is read concurrently, the verification information is used to verify the metadata to confirm that the metadata is stored in the storage node After that, if there is no data loss or damage, the file to be read is read concurrently according to the metadata. Specifically, the computing node may call multiple threads according to the starting position of each slice in the file to be read, and before concurrently reading the data of each slice, the above method further includes the following steps: the computing node according to the verification information Check whether the metadata of the file to be read has changed after it is stored in the storage node. If the metadata of the file to be read has not changed after being stored in the storage node, the file to be read is checked according to each slice. At the starting position in, multiple threads are called to concurrently read the data of each slice.
可选地,该校验信息可以包括校验掩码、元数据校验值、文件校验值、元数据格式版本以及文件格式版本等等,其中,校验掩码用于供计算节点确定此处为元数据头部,因此校验掩码通常位于元数据的头部。元数据校验值用于供计算节点确定元数据在存储至存储节点后是否发生过变化,如果发生变化说明元数据可能已损坏或丢失,计算节点可以使用其他业内通用的数据处理方法读取待读取文件。文件校验值用于供计算节点确定文件在存储至存储节点后是否发生过变化,如果发生变化说明文件可能已损坏或丢失,计算节点可以返回数据处理失败的消息。元数据格式版本用于供计算节点确定自身是否支持读取该格式版本的数据,如果不支持,计算节点可以使用其他业内通用的数据处理方法读取待读取文件。文件格式版本用于供计算节点确定自身是否支持读取该格式版本的文件,如果不支持,计算节点可以使用其他业内通用的数据处理方法读取待读取文件。应理解,上述校验信息还可以包括更多或更少的内容,本申请不作具体限定。并且,验证上述校验信息的方法可以使用业内通用的校验方法,比如哈希校验、sha256校验等等,本申请不作具体限定。Optionally, the verification information may include a verification mask, a metadata verification value, a file verification value, a metadata format version, a file format version, etc., where the verification mask is used for the computing node to determine this Is the header of the metadata, so the check mask is usually located at the header of the metadata. The metadata check value is used by the computing node to determine whether the metadata has changed after it is stored in the storage node. If it changes, the metadata may be damaged or lost. The computing node can use other common data processing methods in the industry to read the data. Read the file. The file check value is used by the computing node to determine whether the file has changed after being stored in the storage node. If the change indicates that the file may be damaged or lost, the computing node can return a data processing failure message. The metadata format version is used by the computing node to determine whether it supports reading the data in this format version. If not, the computing node can use other data processing methods commonly used in the industry to read the file to be read. The file format version is used for the computing node to determine whether it supports reading the file of this format version. If it does not support it, the computing node can use other common data processing methods in the industry to read the file to be read. It should be understood that the above verification information may also include more or less content, which is not specifically limited in this application. In addition, the method for verifying the above verification information can use verification methods commonly used in the industry, such as hash verification, sha256 verification, etc., which are not specifically limited in this application.
计算节点在根据元数据调用多个线程并发读取待读取文件之前,可以先读取元数据头部的校验信息,以确定该元数据在存储至存储节点之后是否发生过改变,在没有发生过改变的情况下,再使用元数据读取待读取文件,从而避免由于元数据发生改变导致计算节点根据错误的元数据信息来读取文件这一情况的发生,提高本申请提供方案的可行性。Before the computing node calls multiple threads to concurrently read the file to be read based on the metadata, it can first read the verification information in the metadata header to determine whether the metadata has changed after being stored in the storage node. In the case of changes, metadata is used to read the file to be read, so as to avoid the occurrence of a situation in which the computing node reads the file according to the wrong metadata information due to the metadata change, and improves the solution provided by this application. feasibility.
在另一种可能的实现方式中,待读取文件的元数据还包括数据类型,在数据类型是稠密矩阵的情况下,元数据还包括特征值类型,特征值类型用于供计算节点初始化内存空间的数据结构,计算节点根据每个切片在待读取文件中的起始位置,调用多个线程,并发读取每个切片的数据之前,还可以包括以下步骤:计算节点根据数据类型初始化内存空间的数据结构。计算节点可以根据元数据中的特征值类型初始化内存数据结构,确保待读取文件不会由于内存数据结构错误而导致数据处理失败,提高待读取文件的读取效率。In another possible implementation, the metadata of the file to be read also includes the data type. In the case that the data type is a dense matrix, the metadata also includes the eigenvalue type. The eigenvalue type is used for the computing node to initialize the memory For the spatial data structure, the computing node calls multiple threads according to the starting position of each slice in the file to be read, and before concurrently reading the data of each slice, it can also include the following steps: the computing node initializes the memory according to the data type Spatial data structure. The computing node can initialize the memory data structure according to the feature value type in the metadata, to ensure that the file to be read will not cause data processing failure due to the error of the memory data structure, and to improve the reading efficiency of the file to be read.
在另一种可能的实现方式中,数据类型是稀疏矩阵的情况下,由于稀疏矩阵的存储形式为:总共包括3行字符,每个数据都通过该3行字符来保存,一行字符表示每个数据对应的“数据列索引”,一行字符表示每个数据对应的“数据值”,一行字符表示每个数据对应的“行数据量”,因此,待读取文件的元数据还包括值数量,值数量用于申请用于存放数据值以及数据列索引的第一内存空间,计算节点根据每个切片在待读取文件中的起始位置,调用多个线程,并发读取每个切片之前,上述方法还包括以下步骤:计算节点根据值数量申请用于存放数据值以及数据列索引的第一内存空间,根据行数申请用于存放行数据量的第二内存空间,根据第一内存空间和第二内存空间获得内存空间。In another possible implementation, when the data type is a sparse matrix, since the storage form of the sparse matrix is: a total of 3 rows of characters are included, and each data is stored by the 3 rows of characters, and a row of characters represents each The “data column index” corresponding to the data, a row of characters represents the “data value” corresponding to each data, and a row of characters represents the “row data volume” corresponding to each data. Therefore, the metadata of the file to be read also includes the number of values, The number of values is used to apply for the first memory space for storing data values and data column indexes. The computing node calls multiple threads according to the starting position of each slice in the file to be read, and reads each slice concurrently. The above method also includes the following steps: the computing node applies for a first memory space for storing data values and data column indexes according to the number of values, applies for a second memory space for storing row data according to the number of rows, and according to the first memory space and The second memory space obtains the memory space.
在待读取文件的数据类型是稀疏矩阵的情况下,计算节点可以根据元数据中的值数量和行数申请内存空间,确保数据类型为稀疏矩阵的待读取文件可以一次性申请内存空间,而无 需多次扩充内存空间,避免了资源浪费,提高待读取文件的读取效率。In the case that the data type of the file to be read is a sparse matrix, the computing node can apply for memory space according to the number of values and rows in the metadata to ensure that the file to be read with the data type of the sparse matrix can apply for memory space at one time. There is no need to expand the memory space multiple times, which avoids waste of resources and improves the efficiency of reading files to be read.
在另一种可能的实现方式中,数据类型是稀疏矩阵的情况下,每个切片在待读取文件中的起始位置包括每个切片的数据列索引起始位置、每个切片的数据值起始位置以及每个切片的行数据量起始位置,计算节点按照每个切片在待读取文件中的起始位置的顺序,将每个切片的数据存储至内存空间包括:计算节点根据每个切片的数据列索引起始位置的顺序以及每个切片的数据值的起始位置的顺序,将每个切片的数据列索引以及数据值存储至第一内存空间,根据每个切片的行数据量的起始位置的顺序,将每个切片的行数据量存储至第二内存空间。In another possible implementation, when the data type is a sparse matrix, the starting position of each slice in the file to be read includes the starting position of the data column index of each slice and the data value of each slice The starting position and the starting position of the row data amount of each slice, the computing node stores the data of each slice in the memory space in the order of the starting position of each slice in the file to be read, including: the computing node according to each slice The order of the starting position of the data column index of each slice and the starting position of the data value of each slice, the data column index and data value of each slice are stored in the first memory space, according to the row data of each slice The order of the starting position of the amount, the row data amount of each slice is stored in the second memory space.
在待读取文件的数据类型是稀疏矩阵的情况下,计算节点可以根据每个切片在待读取文件中的起始位置包括每个切片的数据列索引起始位置、每个切片的数据值起始位置以及每个切片的行数据量起始位置读取稀疏矩阵的三行数据,确保数据类型为稀疏矩阵的待读取文件也可以被并发读取,提高待读取文件的读取效率。In the case that the data type of the file to be read is a sparse matrix, the computing node can include the starting position of the data column index of each slice and the data value of each slice according to the starting position of each slice in the file to be read The starting position and the starting position of the row data amount of each slice reads the three rows of data of the sparse matrix to ensure that the file to be read with the data type of the sparse matrix can also be read concurrently, improving the reading efficiency of the file to be read .
第二方面,提供了另一种数据处理方法,应用于数据处理系统,该数据处理系统包括计算节点和存储节点,上述数据处理方法包括以下步骤:存储节点获取待读取文件,然后根据待读取文件,获得待读取文件的元数据,其中,该待读取文件的元数据包括待读取文件的切片数量、行数、以及每个切片在待读取文件中的起始位置,其中,行数用于供计算节点申请用于存放待读取文件的内存空间,切片数量用于供计算节点创建多个线程,每个切片在待读取文件中的起始位置用于供计算节点调用多个线程,并发读取每个切片的数据,并按照每个切片在待读取文件在中的起始位置的顺序,将每个切片的数据存储至内存空间,最后存储节点存储待读取文件的元数据。In the second aspect, another data processing method is provided, which is applied to a data processing system. The data processing system includes a computing node and a storage node. The above data processing method includes the following steps: the storage node obtains the file to be read, and Fetch the file to obtain the metadata of the file to be read, where the metadata of the file to be read includes the number of slices, the number of rows, and the starting position of each slice in the file to be read, where , The number of rows is used by the computing node to apply for memory space for storing the file to be read, the number of slices is used for the computing node to create multiple threads, and the starting position of each slice in the file to be read is used for the computing node Call multiple threads, read the data of each slice concurrently, and store the data of each slice in the memory space in the order of the starting position of each slice in the file to be read, and finally the storage node stores the data to be read Get the metadata of the file.
由于存储节点提前生成了待读取文件的元数据,使得计算节点读取待读取文件时,可以根据待读取文件的元数据确定待读取文件的长度、切片数量以及每个切片在待读取文件中的起始位置等信息,从而达到一次性申请内存空间,多个线程并发读取文件的目的,不仅避免了由于无法确定数据类型导致内存空间数据结构初始化有误、数据处理失败的问题,还避免了由于无法确定待读取文件的行数导致多次扩充内存空间造成的资源浪费,又可以并发读取文件,使得计算节点读取文件的速度得到极大提升,进一步提升大数据和AI任务的处理效率。Since the storage node generates the metadata of the file to be read in advance, when the computing node reads the file to be read, it can determine the length of the file to be read, the number of slices, and the number of slices to be read according to the metadata of the file to be read. Read the starting position and other information in the file, so as to achieve the purpose of applying for memory space at one time and reading files concurrently by multiple threads, which not only avoids incorrect initialization of the memory space data structure and failure of data processing due to the inability to determine the data type. The problem also avoids the waste of resources caused by multiple expansions of the memory space due to the inability to determine the number of lines of the file to be read, and the file can be read concurrently, so that the speed of the computing node to read the file is greatly improved, and the big data is further improved And the processing efficiency of AI tasks.
在一种可能的实现方式中,存储节点获得待读取文件的元数据的具体流程可以如下:存储节点对待读取文件进行解析,确定待读取文件的数据类型,然后根据待读取文件的数据类型,确定待读取文件的元数据格式,其中,不同的数据类型的待读取文件的元数据格式不同,最后根据待读取文件的元数据格式和待读取文件,生成待读取文件的元数据。In a possible implementation, the specific process for the storage node to obtain the metadata of the file to be read may be as follows: the storage node parses the file to be read, determines the data type of the file to be read, and then according to the data type of the file to be read The data type determines the metadata format of the file to be read. Different data types have different metadata formats for the file to be read. Finally, according to the metadata format of the file to be read and the file to be read, the file to be read is generated The metadata of the file.
存储节点提前对待读取文件进行解析,根据待读取文件的数据类型确定待读取文件的元数据格式,生成用于读取该待读取文件的元数据,然后将待读取文件的元数据存储下来,使得计算节点在读取文件时,可以根据待读取文件的元数据,有效初始化内存的数据结构,并发读取待读取文件,提高文件的读取效率。并且,元数据的可扩展性很强,可以根据各种类型的数据在读取时所需的各种信息,对元数据进行进一步的追加和丰富,使得本申请提供的方案的适用性非常广泛。The storage node parses the file to be read in advance, determines the metadata format of the file to be read according to the data type of the file to be read, generates metadata for reading the file to be read, and then converts the metadata of the file to be read The data is stored, so that when the computing node reads the file, it can effectively initialize the data structure of the memory according to the metadata of the file to be read, and read the file to be read concurrently, thereby improving the efficiency of reading the file. In addition, metadata is highly scalable, and metadata can be further added and enriched according to various information required when reading various types of data, making the applicability of the solution provided by this application very broad .
在另一种可能的实现方式中,存储节点存储待读取文件的元数据的具体步骤可以如下:存储节点将待读取文件的元数据存储于待读取文件中,待读取文件的末尾包括元数据在待读取文件中的起始位置,使得计算节点从待读取文件的末尾获得元数据在待读取文件中的起始位置后,根据元数据在待读取文件中的起始位置,读取待读取文件的元数据。In another possible implementation, the specific steps for the storage node to store the metadata of the file to be read may be as follows: the storage node stores the metadata of the file to be read in the file to be read, and the end of the file to be read Including the starting position of the metadata in the file to be read, so that the computing node obtains the starting position of the metadata in the file to be read from the end of the file to be read, and then according to the starting position of the metadata in the file to be read Start position, read the metadata of the file to be read.
待读取文件的元数据可以存储在该待读取文件的尾部,而在待读取文件的最尾端写有元 数据头部偏移位置和校验掩码,其中,校验掩码位于元数据头部偏移位置之前,使得计算节点读取元数据时,可以将读取指针设置于文件尾部,然后逆向读取一定范围的内容,确定该范围内的数据是否存在校验掩码,如果存在校验掩码则将指针设置于校验掩码处,正向读取元数据头部偏移位置,然后将读取指针设置于该元数据头部偏移位置,正向读取数据获得该元数据。The metadata of the file to be read can be stored at the end of the file to be read, and the offset position of the metadata header and the check mask are written at the end of the file to be read, where the check mask is located at Before the metadata header is offset, when the computing node reads the metadata, it can set the read pointer at the end of the file, and then read a certain range of content in reverse to determine whether the data in the range has a check mask. If there is a check mask, set the pointer at the check mask, read the offset position of the metadata header in the forward direction, then set the read pointer at the offset position of the metadata header, and read the data in the forward direction Get this metadata.
将待读取文件的元数据存储于待读取文件中,计算节点可以从待读取文件的末尾获得元数据在待读取文件中的起始位置,然后进行元数据的读取,而无需存储节点额外划分资源来存储待读取文件的元数据,便于存储节点的文件管理,减轻存储节点的管理负担。Store the metadata of the file to be read in the file to be read, the computing node can obtain the starting position of the metadata in the file to be read from the end of the file to be read, and then read the metadata without The storage node additionally divides resources to store the metadata of the file to be read, which facilitates file management of the storage node and reduces the management burden of the storage node.
在另一种可能的实现方式中,存储节点存储待读取文件的元数据的具体步骤可以如下:存储节点将待读取文件的元数据存储于存储节点的指定路径。In another possible implementation manner, the specific steps of the storage node storing the metadata of the file to be read may be as follows: the storage node stores the metadata of the file to be read in a designated path of the storage node.
在另一种可能的实现方式中,存储节点存储待读取文件的元数据的具体步骤可以如下:存储节点将待读取文件的元数据存储于待读取文件的存储位置。In another possible implementation manner, the specific steps of the storage node storing the metadata of the file to be read may be as follows: the storage node stores the metadata of the file to be read in the storage location of the file to be read.
在另一种可能的实现方式中,待读取文件和待读取文件的元数据包括共同标识,共同标识用于供计算节点根据共同标识从指定路径或者待读取文件的存储位置获取待读取文件的元数据。In another possible implementation manner, the metadata of the file to be read and the file to be read include a common identifier, and the common identifier is used by the computing node to obtain the file to be read from a specified path or a storage location of the file to be read according to the common identifier. Get the metadata of the file.
存储节点为待读取文件和对应的元数据设置共同标识后,将元数据存储在指定路径或者待读取文件的存储位置下,这样,计算节点在读取元数据时,可以根据共同标识从上述指定路径或者待读取文件的存储位置中获取元数据,而无需修改文件的读取逻辑,可适用于更多的计算节点。After the storage node sets a common identifier for the file to be read and the corresponding metadata, it stores the metadata in the specified path or the storage location of the file to be read. In this way, when the computing node reads the metadata, it can use the common identifier from The metadata is obtained from the specified path or the storage location of the file to be read without modifying the reading logic of the file, which can be applied to more computing nodes.
应理解,本申请提供了上述两种元数据存储的方式,具体实现中,可以根据应用环境来灵活确定元数据的存储方式,使得本申请提供的数据处理方法和数据处理方法应用更加广泛。It should be understood that this application provides the above two metadata storage methods. In specific implementation, the metadata storage method can be flexibly determined according to the application environment, so that the data processing methods and data processing methods provided in this application are more widely used.
在另一种可能的实现方式中,待读取文件的元数据包括校验信息,校验信息用于供计算节点校验待读取文件的元数据在存储至存储节点后是否发生过变化。In another possible implementation manner, the metadata of the file to be read includes verification information, and the verification information is used for the computing node to verify whether the metadata of the file to be read has changed after being stored in the storage node.
可选地,该校验信息可以包括校验掩码、元数据校验值、文件校验值、元数据格式版本以及文件格式版本等等,其中,校验掩码用于供计算节点确定此处为元数据头部,因此校验掩码通常位于元数据的头部。元数据校验值用于供计算节点确定元数据在存储至存储节点后是否发生过变化,如果发生变化说明元数据可能已损坏或丢失,计算节点可以使用其他业内通用的数据处理方法读取待读取文件。文件校验值用于供计算节点确定文件在存储至存储节点后是否发生过变化,如果发生变化说明文件可能已损坏或丢失,计算节点可以返回数据处理失败的消息。元数据格式版本用于供计算节点确定自身是否支持读取该格式版本的数据,如果不支持,计算节点可以使用其他业内通用的数据处理方法读取待读取文件。文件格式版本用于供计算节点确定自身是否支持读取该格式版本的文件,如果不支持,计算节点可以使用其他业内通用的数据处理方法读取待读取文件。应理解,上述校验信息还可以包括更多或更少的内容,本申请不作具体限定。并且,验证上述校验信息的方法可以使用业内通用的校验方法,比如哈希校验、sha256校验等等,本申请不作具体限定。Optionally, the verification information may include a verification mask, a metadata verification value, a file verification value, a metadata format version, a file format version, etc., where the verification mask is used for the computing node to determine this Is the header of the metadata, so the check mask is usually located at the header of the metadata. The metadata check value is used by the computing node to determine whether the metadata has changed after it is stored in the storage node. If it changes, the metadata may be damaged or lost. The computing node can use other common data processing methods in the industry to read the data. Read the file. The file check value is used by the computing node to determine whether the file has changed after being stored in the storage node. If the change indicates that the file may be damaged or lost, the computing node can return a data processing failure message. The metadata format version is used by the computing node to determine whether it supports reading the data in this format version. If not, the computing node can use other data processing methods commonly used in the industry to read the file to be read. The file format version is used for the computing node to determine whether it supports reading the file of this format version. If it does not support it, the computing node can use other common data processing methods in the industry to read the file to be read. It should be understood that the above verification information may also include more or less content, which is not specifically limited in this application. In addition, the method for verifying the above verification information can use verification methods commonly used in the industry, such as hash verification, sha256 verification, etc., which are not specifically limited in this application.
存储节点将校验信息写入待读取文件的元数据头部,使得计算节点在根据元数据调用多个线程并发读取待读取文件之前,可以先读取元数据头部的校验信息,以确定该元数据在存储至存储节点之后是否发生过改变,在没有发生过改变的情况下,再使用元数据读取待读取文件,从而避免由于元数据发生改变导致计算节点根据错误的元数据信息来读取文件这一情况的发生,提高本申请提供方案的可行性。The storage node writes the verification information into the metadata header of the file to be read, so that the computing node can read the verification information in the metadata header before calling multiple threads to concurrently read the file to be read based on the metadata , To determine whether the metadata has been changed after it is stored in the storage node. If no changes have occurred, the metadata is used to read the file to be read, so as to avoid the metadata change that causes the computing node to The occurrence of metadata information to read files has improved the feasibility of the solution provided by this application.
在另一种可能的实现方式中,待读取文件的元数据还包括数据类型,数据类型是稠密矩 阵的情况下,元数据还包括特征值类型,特征值类型用于供计算节点根据特征值类型初始化内存空间的数据结构。In another possible implementation, the metadata of the file to be read also includes the data type. When the data type is a dense matrix, the metadata also includes the eigenvalue type. The eigenvalue type is used by the computing node according to the characteristic value. Type initializes the data structure of the memory space.
存储节点将特征值类型放入稠密矩阵的元数据中,使得计算节点可以根据元数据中的特征值类型初始化内存数据结构,确保待读取文件不会由于内存数据结构错误而导致数据处理失败,提高待读取文件的读取效率。The storage node puts the feature value type into the metadata of the dense matrix, so that the computing node can initialize the memory data structure according to the feature value type in the metadata to ensure that the file to be read will not cause data processing failure due to memory data structure errors. Improve the efficiency of reading files to be read.
在另一种可能的实现方式中,数据类型是稀疏矩阵的情况下,由于稀疏矩阵的存储形式为:总共包括3行字符,每个数据都通过该3行字符来保存,一行字符表示每个数据对应的“数据列索引”,一行字符表示每个数据对应的“数据值”,一行字符表示每个数据对应的“行数据量”,因此,待读取文件的元数据还包括值数量,数据类型是稀疏矩阵的情况下,待读取文件包括数据值、数据列索引以及行数据量,元数据还包括值数量,值数量用于供计算节点申请用于存放数据值以及数据列索引的第一内存空间,行数用于供计算节点申请用于存放行数据量的第二内存空间,待读取文件的内存空间包括第一内存空间和第二内存空间。In another possible implementation, when the data type is a sparse matrix, since the storage form of the sparse matrix is: a total of 3 rows of characters are included, and each data is stored by the 3 rows of characters, and a row of characters represents each The “data column index” corresponding to the data, a row of characters represents the “data value” corresponding to each data, and a row of characters represents the “row data volume” corresponding to each data. Therefore, the metadata of the file to be read also includes the number of values, When the data type is a sparse matrix, the file to be read includes the data value, data column index, and row data volume. The metadata also includes the number of values. The number of values is used by the computing node to apply for storing data values and data column indexes. The first memory space, the number of rows is used by the computing node to apply for the second memory space for storing the amount of row data, and the memory space of the file to be read includes the first memory space and the second memory space.
在待读取文件的数据类型是稀疏矩阵的情况下,存储节点将数据值放入稀疏矩阵的元数据中,计算节点可以根据元数据中的值数量和行数申请内存空间,确保数据类型为稀疏矩阵的待读取文件可以一次性申请内存空间,而无需多次扩充内存空间,避免了资源浪费,提高待读取文件的读取效率。When the data type of the file to be read is a sparse matrix, the storage node puts the data value into the metadata of the sparse matrix, and the computing node can apply for memory space according to the number of values and rows in the metadata to ensure that the data type is The file to be read in the sparse matrix can apply for memory space at one time without the need to expand the memory space multiple times, avoiding waste of resources and improving the reading efficiency of the file to be read.
在另一种可能的实现方式中,数据类型是稀疏矩阵的情况下,每个切片在待读取文件中的起始位置包括每个切片的数据列索引起始位置、每个切片的数据值起始位置以及每个切片的行数据量起始位置。In another possible implementation, when the data type is a sparse matrix, the starting position of each slice in the file to be read includes the starting position of the data column index of each slice and the data value of each slice The starting position and the starting position of the row data amount of each slice.
在待读取文件的数据类型是稀疏矩阵的情况下,计算节点可以根据每个切片在待读取文件中的起始位置包括每个切片的数据列索引起始位置、每个切片的数据值起始位置以及每个切片的行数据量起始位置读取稀疏矩阵的三行数据,确保数据类型为稀疏矩阵的待读取文件也可以被并发读取,提高待读取文件的读取效率。In the case that the data type of the file to be read is a sparse matrix, the computing node can include the starting position of the data column index of each slice and the data value of each slice according to the starting position of each slice in the file to be read The starting position and the starting position of the row data amount of each slice reads the three rows of data of the sparse matrix to ensure that the file to be read with the data type of the sparse matrix can also be read concurrently, improving the reading efficiency of the file to be read .
第三方面,提供了一种计算节点,包括用于执行第一方面或第一方面任一种可能实现方式中的数据处理方法的各个模块。In a third aspect, a computing node is provided, which includes modules for executing the data processing method in the first aspect or any one of the possible implementation manners of the first aspect.
第四方面,提供了一种存储节点,包括用于执行第二方面或第二方面任一种可能实现方式中的数据处理方法的各个模块。In a fourth aspect, a storage node is provided, which includes modules for executing the data processing method in the second aspect or any one of the possible implementation manners of the second aspect.
第五方面,提供了一种数据处理系统,包括计算节点和存储节点,计算节点用于实现如第一方面或第一方面任意一种可能的实现方式中所描述的方法的操作步骤,存储节点用于实现如第二方面或第二方面任意一种可能的实现方式中描述的方法的操作步骤。In a fifth aspect, a data processing system is provided, including a computing node and a storage node. The computing node is used to implement the operation steps of the method described in the first aspect or any one of the possible implementations of the first aspect. The storage node It is used to implement the operation steps of the method described in the second aspect or any one of the possible implementation manners of the second aspect.
第六方面,提供了一种计算机程序产品,当其在计算机上运行时,使得计算机执行上述各方面所述的方法。In a sixth aspect, a computer program product is provided, which when running on a computer, causes the computer to execute the methods described in the above aspects.
第七方面,提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述各方面所述的方法。In a seventh aspect, a computer-readable storage medium is provided, and instructions are stored in the computer-readable storage medium, which when run on a computer, cause the computer to execute the methods described in the foregoing aspects.
本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。On the basis of the implementation manners provided in the above aspects, this application can be further combined to provide more implementation manners.
附图说明Description of the drawings
下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍:The following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art:
图1是本申请提供的一种多核处理器的架构示意图;FIG. 1 is a schematic diagram of the architecture of a multi-core processor provided by the present application;
图2是本申请提供的一种数据处理系统的架构示意图;Fig. 2 is a schematic diagram of the architecture of a data processing system provided by the present application;
图3是本申请提供的一种数据处理系统的结构示意图;Figure 3 is a schematic structural diagram of a data processing system provided by the present application;
图4是本申请提供的一种数据处理方法的步骤流程示意图;Fig. 4 is a schematic diagram of the flow of steps of a data processing method provided by the present application;
图5~图6是本申请提供的元数据格式的示意图;Figures 5 to 6 are schematic diagrams of the metadata format provided by this application;
图7是本申请提供的一种包含有元数据的待读取文件的格式;Figure 7 is a format of a file to be read containing metadata provided by this application;
图8是本申请提供的一种数据处理方法的步骤流程示意图;FIG. 8 is a schematic flowchart of steps of a data processing method provided by the present application;
图9是本申请提供的另一种数据处理方法的流程示意图;FIG. 9 is a schematic flowchart of another data processing method provided by this application;
图10是本申请提供的另一种数据处理方法的流程示意图;FIG. 10 is a schematic flowchart of another data processing method provided by this application;
图11是本申请提供的另一种数据处理方法的流程示意图;FIG. 11 is a schematic flowchart of another data processing method provided by this application;
图12是本申请提供的一种计算节点的结构示意图;FIG. 12 is a schematic structural diagram of a computing node provided by this application;
图13是本申请提供的一种服务器的结构示意图;FIG. 13 is a schematic diagram of the structure of a server provided by the present application;
图14是本申请提供的一种存储阵列的结构示意图。FIG. 14 is a schematic structural diagram of a storage array provided by the present application.
具体实施方式detailed description
为了便于理解本申请的技术方案,首先,对本申请涉及的部分术语进行解释说明。值得说明的是,本申请的实施方式部分使用的术语仅用于对本申请的具体实施例进行解释,而非旨在限定本申请。In order to facilitate the understanding of the technical solutions of this application, firstly, some terms involved in this application will be explained. It is worth noting that the terms used in the implementation mode of this application are only used to explain specific embodiments of this application, and are not intended to limit this application.
大数据:无法在一定时间范围内用常规软件工具进行捕捉、管理和处理的数据集合。大数据技术的战略意义在于对海量数据进行专业化处理,处理后的数据可以应用于各个行业,包括金融、汽车、餐饮、电信、能源等等,举例来说,利用大数据技术和物联网技术的无人驾驶汽车,利用大数据技术分析客户行为进行商品推荐、利用大数据技术实现信贷风险分析等等。Big data: A collection of data that cannot be captured, managed, and processed with conventional software tools within a certain time frame. The strategic significance of big data technology lies in the professional processing of massive amounts of data. The processed data can be applied to various industries, including finance, automobiles, catering, telecommunications, energy, etc., for example, using big data technology and Internet of Things technology Of unmanned cars, using big data technology to analyze customer behavior for product recommendation, using big data technology to realize credit risk analysis, and so on.
人工智能:利用数字计算机或者数字计算机控制的计算节点模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。人工智能的应用场景十分广泛,比如人脸识别、车辆识别、行人重识别、数据处理应用等等。AI的底层模型是一种实现AI的数学方法集合,可以使用大量的样本对AI模型进行训练来使训练完成的AI模型获得预测的能力,其中,用于训练AI模型的样本可以是从大数据平台获取的样本。Artificial Intelligence: Theories, methods, technologies and application systems that use digital computers or computing nodes controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results. The application scenarios of artificial intelligence are very wide, such as face recognition, vehicle recognition, pedestrian re-recognition, data processing applications, and so on. The underlying model of AI is a collection of mathematical methods to achieve AI. A large number of samples can be used to train the AI model to make the trained AI model obtain the ability to predict. Among them, the samples used to train the AI model can be from big data. Samples obtained by the platform.
并发:两个或多个事件在同一段时间内同时发生,在操作系统的任务处理中,并发则是指在一段时间内有多个线程操作相同资源处理相同或不同的任务。需要注意的是,并发包括多个线程在一段时间内同时操作(并行),也包括多个线程在一段时间内分时交替操作。Concurrency: Two or more events occur at the same time in the same period of time. In the task processing of the operating system, concurrency refers to multiple threads operating the same resource to process the same or different tasks in a period of time. It should be noted that concurrency includes multiple threads operating at the same time (parallel) within a period of time, and also includes multiple threads operating alternately in time-sharing within a period of time.
内核(core):处理器的内核又称为处理器的核心,是处理器的重要组成部分。内核可以理解为处理器的可执行单元,处理器所有的计算、接收/存储命令、数据处理等任务都由核心执行。Core: The core of the processor is also called the core of the processor and is an important part of the processor. The kernel can be understood as the executable unit of the processor, and all tasks of the processor, such as calculation, receiving/storing commands, and data processing, are executed by the core.
线程(thread):线程是操作系统能够进行运算调度的最小单位。一个内核至少对应一个线程,通过超线程技术,一个内核还可以对应两个及以上的线程,即同时运行多个线程。Thread: Thread is the smallest unit that the operating system can perform operation scheduling. A core corresponds to at least one thread. Through hyper-threading technology, a core can also correspond to two or more threads, that is, multiple threads are running at the same time.
多核处理器:处理器中可以部署有一个或多个内核。若处理器中部署的内核个数M不小于2,则处理器称为多核处理器。图1是一种多核处理器芯片的结构示意图,其中,图1以M=8为例进行描述,如图1所示,多核处理器100的八个内核分为第一内核101、第二内核102、第三内核103、第四内核104、第五内核105、第六内核106、第七内核107以及第八内核108。其中,第一内核为主内核,负责任务调度(task scheduling),比如根据每个内核适合处 理的任务以及是否空闲等因素,将任务合理分配到其它内核进行处理。多核处理器中还包括用于存储数据的内存109,如双倍速率同步动态随机存储器(double data rate synchronous dynamic random access memory,DDR SDRAM)。其中,每个内核与内存以总线(bus)110的方式进行连接,且每个内核可以通过共享内存的方式访问内存中的数据。应理解,并发处理是多核处理器的优势所在,多核处理器可以在特定的时钟周期内调用多个线程并发处理更多的任务。Multi-core processor: One or more cores can be deployed in the processor. If the number M of cores deployed in the processor is not less than 2, the processor is called a multi-core processor. Figure 1 is a schematic diagram of the structure of a multi-core processor chip. Figure 1 takes M=8 as an example for description. As shown in Figure 1, the eight cores of the multi-core processor 100 are divided into a first core 101 and a second core. 102, the third core 103, the fourth core 104, the fifth core 105, the sixth core 106, the seventh core 107, and the eighth core 108. Among them, the first core is the main core and is responsible for task scheduling. For example, according to factors such as the tasks that each core is suitable for processing and whether it is idle, tasks are reasonably allocated to other cores for processing. The multi-core processor also includes a memory 109 for storing data, such as double data rate synchronous dynamic random access memory (DDR SDRAM). Among them, each core and the memory are connected in a bus 110, and each core can access the data in the memory by sharing the memory. It should be understood that concurrent processing is the advantage of the multi-core processor, and the multi-core processor can call multiple threads in a specific clock cycle to concurrently process more tasks.
多CPU多核处理器:又称为多片多核处理器,该处理器包含了多个如图1所示的多核处理器芯片。多个多核处理器芯片通过互连结构(interconnect)相连,互连结构可以通过多种实现方式实现,譬如总线。Multi-CPU multi-core processor: also known as multi-chip multi-core processor, this processor contains multiple multi-core processor chips as shown in Figure 1. Multiple multi-core processor chips are connected through an interconnect structure, and the interconnect structure can be implemented in a variety of ways, such as a bus.
下面结合附图进一步介绍本申请涉及的应用场景。The application scenarios involved in this application will be further introduced below in conjunction with the accompanying drawings.
图2是一种大数据或者AI任务处理系统的架构示意图,图2也可以称为一种数据处理系统的架构示意图,该数据处理系统用于计算节点实现文件的读取过程和存储节点实现文件的存储过程。该系统包括计算节点210、存储节点220以及数据采集节点230,其中,计算节点210和存储节点220上的处理器通常为图1所示的多核处理器100或者多CPU多核处理器。存储节点220、数据采集节点230和计算节点210之间通过网络连接,该网络可以是有线网络,也可以是无线网络,还可以是二者的混合。Figure 2 is a schematic diagram of the architecture of a big data or AI task processing system. Figure 2 can also be referred to as a schematic diagram of the architecture of a data processing system. The data processing system is used for computing nodes to implement file reading processes and storage nodes to implement files The stored procedure. The system includes a computing node 210, a storage node 220, and a data collection node 230. The processors on the computing node 210 and the storage node 220 are usually the multi-core processor 100 or the multi-CPU multi-core processor shown in FIG. 1. The storage node 220, the data collection node 230, and the computing node 210 are connected through a network, and the network may be a wired network, a wireless network, or a mixture of the two.
其中,计算节点210和存储节点220可以是物理服务器,比如X86服务器、ARM服务器等等;也可以是基于通用的物理服务器结合网络功能虚拟化(network functions virtualization,NFV)技术实现的虚拟机(virtual machine,VM),虚拟机指通过 软件模拟的具有完整 硬件系统功能的、运行在一个完全 隔离环境中的完整 计算机系统,比如云数据中心内的虚拟机,本申请不作具体限定。存储节点220还可以是其他具有存储功能的存储设备,如存储阵列。应理解,计算节点以及存储节点220可以是单个物理服务器或者单个虚拟机,还可以构成计算机集群,本申请不作具体限定。 Among them, the computing node 210 and the storage node 220 may be physical servers, such as X86 servers, ARM servers, etc.; they may also be virtual machines based on general physical servers combined with network functions virtualization (NFV) technology. machine, VM), the virtual machine refers to a function of a complete hardware system, a complete computer system running software simulation in a completely isolated environment, such as virtual machines in a cloud data center, the present application is not particularly limited. The storage node 220 may also be other storage devices with storage functions, such as a storage array. It should be understood that the computing node and the storage node 220 may be a single physical server or a single virtual machine, and may also constitute a computer cluster, which is not specifically limited in this application.
数据采集节点230可以是一个硬件设备,例如,物理服务器或物理服务器组成的集群,还可以是软件,例如,服务器中部署的数据收集系统、虚拟机,数据收集系统可收集其他服务器中存储的数据,比如收集网站服务器中的日志信息,也可收集其他硬件设备采集的数据,应理解,上述举例仅用于说明,本申请不作具体限定。The data collection node 230 can be a hardware device, for example, a physical server or a cluster of physical servers, or software, for example, a data collection system deployed in a server, a virtual machine, and the data collection system can collect data stored in other servers. For example, the log information in the website server can be collected, and the data collected by other hardware devices can also be collected. It should be understood that the above examples are only for illustration, and this application is not specifically limited.
值得注意的,附图2是本申请实施例提供的一种系统架构的示意图,图中所示节点、模块等之间的位置关系不构成任何限制。举例来说,图2中的计算节点210、存储节点220以及数据采集节点230均以独立的三个设备或者服务器集群为例进行了说明,具体实现中,计算节点210、存储节点220以及数据采集节点230还可以是同一个服务器集群或服务器,或者计算节点210和存储节点220是同一个服务器集群或服务器等等,本申请不作具体限定。It is worth noting that FIG. 2 is a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship between nodes, modules, etc. shown in the figure does not constitute any limitation. For example, the computing node 210, the storage node 220, and the data collection node 230 in FIG. 2 are all described by taking three independent devices or server clusters as an example. In a specific implementation, the computing node 210, the storage node 220, and the data collection node The node 230 may also be the same server cluster or server, or the computing node 210 and the storage node 220 may be the same server cluster or server, etc., which is not specifically limited in this application.
在图2所示的系统中,数据采集节点230采集各种原始数据,并将其发送至存储节点220,存储节点220对接收到的原始数据进行数据处理后,生成待读取文件并存储在存储节点220中,应理解,由于原始数据的来源十分广泛,数据结构十分复杂,因此存储节点220需要将原始数据“翻译”为一种统一且能够被处理器直接读写的格式进行存储,其中,数据处理可以包括数据清洗、特征提取、格式转换等等,本申请不作具体限定。计算节点210从存储节点220中读取各种待读取文件,将其加载在计算节点210的内存109中,计算节点210的多核处理器100根据内存109中的数据来完成大数据或者AI任务的相关运算。图2以第二内核102完成AI任务,第三内核103完成大数据任务为例进行了说明,具体实现中,多核处理器100可以并发处理多个任务,多个内核在特定的时钟周期内可以处理同一个AI任务、同一个 大数据任务或者同一个数据处理的任务,本申请不对此进行具体限定。In the system shown in FIG. 2, the data collection node 230 collects various raw data and sends them to the storage node 220. After the storage node 220 performs data processing on the received raw data, the file to be read is generated and stored in In the storage node 220, it should be understood that since the source of the original data is very wide and the data structure is very complex, the storage node 220 needs to "translate" the original data into a unified format that can be directly read and written by the processor for storage. , Data processing may include data cleaning, feature extraction, format conversion, etc., which is not specifically limited in this application. The computing node 210 reads various files to be read from the storage node 220 and loads them into the memory 109 of the computing node 210. The multi-core processor 100 of the computing node 210 completes big data or AI tasks according to the data in the memory 109 Related operations. Fig. 2 illustrates that the second core 102 completes the AI task and the third core 103 completes the big data task as an example. In a specific implementation, the multi-core processor 100 can process multiple tasks concurrently, and multiple cores can process multiple tasks in a specific clock cycle. This application does not specifically limit the processing of the same AI task, the same big data task, or the same data processing task.
举例来说,假设数据采集节点230为部署有特定服务(例如,Kafka和/或Flume)的云服务器,其中,Kafka用于提供高吞吐、高可扩展的分布式消息队列服务,而Flume是高可靠、高可用、分布式的海量日志采集、聚合和传输系统。存储节点220为部署有分布式文件系统(hadoop distributed file system,HDFS)的计算机集群,其中,存储节点220上还可以部署有数据处理系统,比如Spark,其中,Spark是一个用于大规模数据处理的统一分析引擎。计算节点210为部署有Spark-ML的计算机集群,其中,Spark-ML用于处理机器学习(machine learning,ML)任务。For example, suppose that the data collection node 230 is a cloud server deployed with specific services (for example, Kafka and/or Flume), where Kafka is used to provide a high-throughput and highly scalable distributed message queue service, and Flume is a high-throughput, highly-scalable distributed message queue service. Reliable, highly available, distributed massive log collection, aggregation and transmission system. The storage node 220 is a computer cluster deployed with a distributed file system (hadoop distributed file system, HDFS). Among them, the storage node 220 can also be deployed with a data processing system, such as Spark. Among them, Spark is a large-scale data processing system. Unified analysis engine. The computing node 210 is a computer cluster deployed with Spark-ML, where Spark-ML is used to process machine learning (ML) tasks.
在上述例子中,部署有Kafka和/或Flume的云服务器(数据采集节点230)可以先产生海量原始数据,并将原始数据保存在HDFS(存储节点220)中,存储节点220的Spark可以读取原始数据进行数据处理,比如对原始数据进行特征提取、格式转换等等,将原始数据转换为可以被机器学习或者大数据处理的数据格式,生成待读取文件并将其保存在HDFS中。最后,Spark-ML(计算节点230)从HDFS中读取待读取文件,将其加载在内存109中,多核处理器100根据内存109中的内存数据进行机器学习任务,比如k均值聚类算法(k-means clustering algorithm,K-means)或者线性回归(linear regression)处理。In the above example, the cloud server (data collection node 230) deployed with Kafka and/or Flume can first generate massive amounts of raw data, and save the raw data in HDFS (storage node 220), which can be read by Spark of storage node 220 Perform data processing on the original data, such as feature extraction and format conversion on the original data, convert the original data into a data format that can be processed by machine learning or big data, generate the file to be read and save it in HDFS. Finally, Spark-ML (computing node 230) reads the file to be read from HDFS and loads it into the memory 109. The multi-core processor 100 performs machine learning tasks based on the memory data in the memory 109, such as k-means clustering algorithm (k-means clustering algorithm, K-means) or linear regression (linear regression) processing.
综上可知,计算节点210在进行大数据、机器学习等任务时,需要先从存储节点220读取待读取文件,将待读取文件加载在计算节点210的内存109中(图2中的步骤1),再由计算节点210根据内存109中的数据来完成大数据或者机器学习任务的相关运算(图2中的步骤2)。In summary, when the computing node 210 is performing tasks such as big data and machine learning, it needs to first read the file to be read from the storage node 220, and load the file to be read into the memory 109 of the computing node 210 (in Figure 2) Step 1), the computing node 210 then completes the related operations of the big data or machine learning tasks according to the data in the memory 109 (step 2 in FIG. 2).
接下来,结合附图3以数据处理方法为例进一步介绍本申请提供的数据处理方法。Next, the data processing method provided in this application will be further introduced with reference to FIG. 3 by taking the data processing method as an example.
本申请提供了如图3所示的一种数据处理系统400,应理解,使用图3所示的数据处理系统400在图2所示的应用场景中进行数据处理,可以大大提高计算节点210数据处理的速度,进而提高计算节点210处理大数据或者AI任务的效率。This application provides a data processing system 400 as shown in FIG. 3. It should be understood that using the data processing system 400 shown in FIG. 3 to perform data processing in the application scenario shown in FIG. 2 can greatly improve the data of the computing node 210. The processing speed further improves the efficiency of the computing node 210 in processing big data or AI tasks.
如图3所示,该数据处理系统400包括计算节点210以及存储节点220,计算节点210和存储节点220的具体形态和连接方式可以参考图1实施,这里不重复赘述。As shown in FIG. 3, the data processing system 400 includes a computing node 210 and a storage node 220. The specific form and connection manner of the computing node 210 and the storage node 220 can be implemented with reference to FIG. 1, and details are not repeated here.
存储节点220包括元数据生成单元221,元数据生成单元221用于生成待读取文件的元数据,元数据记录有待读取文件的基础信息,该基础信息至少包括待读取文件的行数、最大切片数量以及每个切片在待读取文件中的起始位置,比如待读取文件的最大切片数量为3,行数为9,切片1的起始位置为待读取文件的第1行,切片2的起始位置为待读取文件的第4行,切片3的起始位置为待读取文件的第7行。具体实现中,元数据还可以包括更多的信息,比如特征值类型、列数等等,具体可以根据待读取文件的数据类型确定,本申请不作具体限定。The storage node 220 includes a metadata generating unit 221, which is used to generate metadata of the file to be read. The metadata records basic information of the file to be read. The basic information includes at least the number of lines of the file to be read, The maximum number of slices and the starting position of each slice in the file to be read. For example, the maximum number of slices in the file to be read is 3, the number of rows is 9, and the starting position of slice 1 is the first line of the file to be read. , The starting position of slice 2 is the 4th line of the file to be read, and the starting position of slice 3 is the 7th line of the file to be read. In a specific implementation, the metadata may also include more information, such as the type of feature value, the number of columns, etc., which may be specifically determined according to the data type of the file to be read, which is not specifically limited in this application.
需要说明的,元数据生成单元221只记录待读取文件的最大切片数量和每个切片在待读取文件中的起始位置,不会对待读取文件进行真正的切片,待读取文件以未切片的状态完整存储在存储节点220中。并且,元数据可以是以一个单独文件的形态与待读取文件一同存储在存储节点中个,也可以是与待读取文件整合为一个数据处理在存储节点中,元数据具体的存储过程将在下文图4实施例中的步骤S520进行描述。It should be noted that the metadata generating unit 221 only records the maximum number of slices of the file to be read and the starting position of each slice in the file to be read. The file to be read is not actually sliced. The unsliced state is completely stored in the storage node 220. In addition, metadata can be stored in the storage node together with the file to be read in the form of a separate file, or it can be integrated with the file to be read into a data processing in the storage node. The specific storage process of metadata will Step S520 in the embodiment of FIG. 4 is described below.
具体实现中,元数据生成单元221可以在存储节点220接收到原始数据时,根据该原始数据生成相应的元数据,也可以在存储节点220对原始数据进行数据处理(比如前述内容的数据清洗、特征提取以及格式转换等)之后,生成待读取文件之前,对数据处理后的数据生成相应的元数据,还可以在存储节点220已生成待读取文件之后,根据待读取文件生成对应 的元数据,本申请不对元数据生成单元221的输入数据进行限定。In specific implementation, the metadata generating unit 221 may generate corresponding metadata based on the original data when the storage node 220 receives the original data, or perform data processing on the original data at the storage node 220 (such as the aforementioned data cleaning, After feature extraction and format conversion, etc.), before the file to be read is generated, corresponding metadata is generated for the data after data processing. After the storage node 220 has generated the file to be read, the corresponding metadata can be generated according to the file to be read. Metadata, this application does not limit the input data of the metadata generating unit 221.
计算节点210包括元数据读取单元211和切片读取单元212,其中,元数据读取单元211用于读取待读取文件的元数据,切片读取单元212用于根据元数据确定待读取文件的行数、切片数量x以及每个切片在待读取文件中的起始位置,根据行数申请一段用于存放待读取文件的内存空间,然后向y个线程发送数据读取请求(y是小于或等于x的整数,比如切片数量为3,线程的数量可以是1或2或3,其中y等于x时多个线程可以并行读取待读取文件的多个切片),其中,每个数据读取请求都携带一个切片在待读取文件中的起始位置和之前申请的内存空间的地址,比如线程1接收到的数据读取请求携带切片1在待读取文件中的起始位置,线程2接收到的数据读取请求携带切片2在待读取文件中的起始位置,线程3接收到的数据读取请求携带切片3在待读取文件中的起始位置。最后,响应于该数据读取请求,y个线程根据接收到的切片的起始位置,并发读取待读取文件的切片,并按照每个切片在待读取文件中的起始位置的顺序,将读取到的切片写入上述内存空间。The computing node 210 includes a metadata reading unit 211 and a slice reading unit 212. The metadata reading unit 211 is used to read the metadata of the file to be read, and the slice reading unit 212 is used to determine the metadata to be read according to the metadata. Take the number of lines of the file, the number of slices x, and the starting position of each slice in the file to be read, apply for a memory space for storing the file to be read according to the number of lines, and then send data read requests to y threads (y is an integer less than or equal to x. For example, the number of slices is 3, and the number of threads can be 1 or 2 or 3. When y is equal to x, multiple threads can read multiple slices of the file to be read in parallel), where , Each data read request carries the starting position of a slice in the file to be read and the address of the previously applied memory space. For example, the data read request received by thread 1 carries the value of slice 1 in the file to be read. Starting position, the data read request received by thread 2 carries the starting position of slice 2 in the file to be read, and the data reading request received by thread 3 carries the starting position of slice 3 in the file to be read. Finally, in response to the data read request, the y threads concurrently read the slices of the file to be read according to the starting position of the received slice, and follow the order of the starting position of each slice in the file to be read , Write the read slice into the above memory space.
值得注意的是,图3以一个内核对应一个线程为例(比如图3中的内核1对应线程1,内核2对应线程2,内核3对应线程3)进行了说明,具体实现中,如果计算节点210的多核处理器或者多片多核处理器使用了超线程技术,一个内核还可以对应多个线程,比如内核1对应线程1和线程2,内核2对应线程3,或者,内核1对应线程1~线程3等等,从而达到多个内核并发读取文件的目的,提高资源利用率,提高数据处理效率。It is worth noting that Figure 3 takes one core corresponding to one thread as an example (for example, in Figure 3, core 1 corresponds to thread 1, core 2 corresponds to thread 2, and core 3 corresponds to thread 3). In the specific implementation, if the computing node The 210 multi-core processor or multi-chip multi-core processor uses hyper-threading technology. A core can also correspond to multiple threads. For example, core 1 corresponds to thread 1 and thread 2, core 2 corresponds to thread 3, or core 1 corresponds to thread 1~ Thread 3 and so on, so as to achieve the purpose of multiple cores to read files concurrently, improve resource utilization, and improve data processing efficiency.
仍以前述例子为例,假设数据采集节点230为部署有Kafka和/或Flume的云服务器,存储节点220为部署有HDFS和Spark的计算机集群,计算节点210为部署有Spark-ML的计算机集群,那么上述元数据生成单元221可以部署于Spark,元数据读取单元211和切片读取单元212可以部署于Spark-ML。Still taking the foregoing example as an example, suppose that the data collection node 230 is a cloud server deployed with Kafka and/or Flume, the storage node 220 is a computer cluster deployed with HDFS and Spark, and the computing node 210 is a computer cluster deployed with Spark-ML. Then the above-mentioned metadata generating unit 221 may be deployed in Spark, and the metadata reading unit 211 and the slice reading unit 212 may be deployed in Spark-ML.
在上述例子中,部署有Kafka和/或Flume的云服务器(数据采集节点230)可以先产生海量原始数据,并将原始数据保存在HDFS(存储节点220)中,存储节点220的Spark可以先读取原始数据进行数据处理,比如对原始数据进行特征提取、格式转换等等,然后根据数据处理后的数据,生成待读取文件和对应的元数据,再将待读取文件和对应的元数据保存在HDFS中,最后,Spark-ML(计算节点230)从HDFS中读取待读取文件时,先读取待读取文件的元数据,然后根据该元数据中的信息,申请一段连续的内存空间,再调用多个线程并发读取待读取文件,将其加载在之前申请的内存空间中,再根据内存109中的内存数据进行机器学习任务。计算节点230在读取待读取文件时,不但可以并发读取,还可以避免多次申请内存和多次拷贝数据造成的资源浪费,使得数据处理的效率得到极大提升。In the above example, the cloud server (data collection node 230) deployed with Kafka and/or Flume can first generate a large amount of raw data, and save the raw data in HDFS (storage node 220), and the Spark of storage node 220 can be read first Take the original data for data processing, such as feature extraction and format conversion of the original data, and then generate the file to be read and the corresponding metadata based on the data after data processing, and then combine the file to be read and the corresponding metadata Stored in HDFS. Finally, when Spark-ML (computing node 230) reads the file to be read from HDFS, it first reads the metadata of the file to be read, and then applies for a continuous segment based on the information in the metadata. In the memory space, multiple threads are called to concurrently read the file to be read, loaded in the previously requested memory space, and then perform machine learning tasks based on the memory data in the memory 109. When the computing node 230 reads the file to be read, it can not only read concurrently, but also avoid resource waste caused by multiple applications for memory and multiple copies of data, which greatly improves the efficiency of data processing.
需要说明的,元数据读取单元211在读取元数据之前,将会判断待读取文件是否存在对应的元数据,在待读取文件没有元数据的情况下,可以通知一个线程中的切片读取单元212按照当前业界已有的数据处理方法进行待读取文件的读取,本申请不对此进行限定。It should be noted that before the metadata reading unit 211 reads the metadata, it will determine whether the file to be read has corresponding metadata. If the file to be read does not have metadata, it can notify the slice in a thread. The reading unit 212 reads the file to be read according to the current data processing method in the industry, which is not limited in this application.
综上可知,本申请提供的数据处理系统,该系统中的存储节点220在计算节点210读取待读取文件之前,提前生成了待读取文件的元数据,使得计算节点210读取待读取文件时,可以根据待读取文件的元数据确定待读取文件的长度、切片数量以及每个切片在待读取文件中的起始位置等信息,从而达到一次性申请内存空间,多个线程并发读取文件的目的,不仅避免了由于无法确定数据类型导致内存空间数据结构初始化有误、数据处理失败的问题,还避免了由于无法确定待读取文件的行数导致多次扩充内存空间造成的资源浪费,又可以并发读取文件,使得计算节点210读取文件的速度得到极大提升,进一步提升大数据和AI任务的处理效率。In summary, in the data processing system provided by this application, the storage node 220 in the system generates metadata of the file to be read before the computing node 210 reads the file to be read, so that the computing node 210 can read the file to be read. When fetching a file, you can determine the length of the file to be read, the number of slices, and the starting position of each slice in the file to be read based on the metadata of the file to be read, so as to achieve a one-time application for memory space. The purpose of threads to read files concurrently not only avoids the problem of incorrect initialization of the memory space data structure and failure of data processing due to the inability to determine the data type, but also avoids multiple expansions of the memory space due to the inability to determine the number of lines of the file to be read The resulting waste of resources, and the ability to read files concurrently, greatly improves the speed at which the computing node 210 reads files, and further improves the processing efficiency of big data and AI tasks.
下面对本申请提供的适用于上述数据处理系统400的数据处理方法以及数据处理方法进行解释说明。The following explains the data processing method and the data processing method applicable to the above-mentioned data processing system 400 provided in the present application.
参考前述内容可知,在计算节点210读取文件之前,存储节点220需要根据待读取文件生成对应的元数据,然后将待读取文件和对应的元数据存储在存储节点220中,因此下面先结合图5,对本申请提供数据处理方法进行详细说明。With reference to the foregoing content, before the computing node 210 reads the file, the storage node 220 needs to generate corresponding metadata according to the file to be read, and then store the file to be read and the corresponding metadata in the storage node 220. Therefore, the following first With reference to Fig. 5, the data processing method provided in this application will be described in detail.
如图5所示,元数据存储节点220生成元数据的具体流程可以包括以下步骤:As shown in FIG. 5, the specific process of generating metadata by the metadata storage node 220 may include the following steps:
S510:从数据采集节点230获取待读取文件,并对待读取文件进行解析,获得待读取文件的元数据。S510: Obtain the file to be read from the data collection node 230, and parse the file to be read to obtain metadata of the file to be read.
可以理解的,如果元数据的信息不足,计算节点210读取文件时仍可能出现数据处理效率低的问题,如果元数据过于丰富,又会增加计算节点210读取元数据所需的时间,降低元数据读取效率,元数据包含的信息对后续数据处理的效率有着很大影响,为此,本申请提供了多种元数据格式以适应各种应用场景。具体实现中,存储节点对待读取文件进行解析后,可以先确定待读取文件的数据类型,然后根据待读取文件的数据类型,确定待读取文件的元数据格式,其中,不同的数据类型的待读取文件的元数据格式不同,最后根据元数据格式和待读取文件的解析结果,生成所述待读取文件的元数据。It is understandable that if the metadata information is insufficient, the computing node 210 may still encounter the problem of low data processing efficiency when reading files. If the metadata is too rich, the time required for the computing node 210 to read the metadata will increase and decrease Metadata reading efficiency, the information contained in the metadata has a great impact on the efficiency of subsequent data processing. For this reason, this application provides a variety of metadata formats to adapt to various application scenarios. In specific implementation, after the storage node parses the file to be read, it can first determine the data type of the file to be read, and then determine the metadata format of the file to be read according to the data type of the file to be read. Among them, different data Types of files to be read have different metadata formats, and finally, according to the metadata format and the analysis result of the file to be read, the metadata of the file to be read is generated.
下面对本申请提供的元数据的格式进行简要说明。The following is a brief description of the format of the metadata provided in this application.
参考前述内容可知,元数据记录有待读取文件的基础信息,该基础信息至少包括待读取文件的行数、最大切片数量、以及每个切片在待读取文件中的起始位置,因此,With reference to the foregoing, it can be seen that the metadata records the basic information of the file to be read. The basic information includes at least the number of lines of the file to be read, the maximum number of slices, and the starting position of each slice in the file to be read. Therefore,
示例性的,该元数据的格式可以如图5所示,其中,元数据的格式至少包括基础信息610,基础信息610包括:Exemplarily, the format of the metadata may be as shown in FIG. 5, where the format of the metadata includes at least basic information 610, and the basic information 610 includes:
(1)行数,用于标识待读取文件所包含的总行数,供计算节点210申请用于存放待读取文件的内存空间。(1) The number of rows is used to identify the total number of rows contained in the file to be read, for the computing node 210 to apply for memory space for storing the file to be read.
(2)切片数量,用于标识每个待读取文件所包含的切片的数量,供计算节点210申请多个线程来并发读取待读取文件。(2) The number of slices is used to identify the number of slices contained in each file to be read, for the computing node 210 to apply for multiple threads to concurrently read the file to be read.
需要说明的,该切片数量通常为待读取文件的最大切片数量,该最大切片数量是一个经验值。可以理解的,如果待读取文件的切片数量过多,将会导致待读取文件的元数据长度过大,影响计算节点210读取元数据的速度,如果待读取文件的切片数量过少,将会导致计算节点210在并发读取待读取文件时,有一部分内核仍处于空闲状态,造成资源浪费。因此,待读取文件的最大切片数量可以根据计算节点210的内核数量确定,比如最大切片数量等于计算节点210处理器核数量,或者最大切片数量与处理器核数量呈一定的比例关系,本申请不作具体限定。It should be noted that the number of slices is usually the maximum number of slices of the file to be read, and the maximum number of slices is an empirical value. It is understandable that if the number of slices of the file to be read is too large, the metadata length of the file to be read will be too large, which will affect the speed of the computing node 210 to read the metadata. If the number of slices of the file to be read is too small , Will cause a part of the cores to remain idle when the computing node 210 concurrently reads the file to be read, which causes a waste of resources. Therefore, the maximum number of slices of the file to be read can be determined according to the number of cores of the computing node 210. For example, the maximum number of slices is equal to the number of processor cores of the computing node 210, or the maximum number of slices is proportional to the number of processor cores. There is no specific limitation.
(3)每个切片的起始位置,用于供每个线程并发读取待读取文件,每个线程可以根据一个切片在待读取文件中的起始位置,读取一个待读取文件的切片并将其放入之前申请的内存空间中,从而完成待读取文件的并发读取,提高待读取文件的读取效率。(3) The starting position of each slice is used for each thread to read the file to be read concurrently. Each thread can read a file to be read according to the starting position of a slice in the file to be read And put it into the previously requested memory space, thereby completing the concurrent reading of the file to be read and improving the reading efficiency of the file to be read.
具体实现中,每个切片的起始位置可以是每个切片在待读取文件中起始位置的偏移值和行号,每个线程可以根据该行号和下一个切片的起始位置的行号,确定该切片的长度l,然后将读指针设置为该偏移值,读取长度为l的切片。当然,每个切片的起始位置还可以包括更多或者更少的内容,比如只有每个切片在待读取文件中的起始位置的偏移值,或者每个切片的起始位置还包括每个切片的长度,本申请不对此进行限定。In specific implementation, the starting position of each slice can be the offset value and line number of the starting position of each slice in the file to be read, and each thread can be based on the line number and the starting position of the next slice. Line number, determine the length l of the slice, then set the read pointer to the offset value, and read the slice with length l. Of course, the starting position of each slice can also include more or less content, such as only the offset value of the starting position of each slice in the file to be read, or the starting position of each slice also includes The length of each slice is not limited in this application.
在一实施例中,由于数据在存储节点220中可能会出现数据缺失或者改变的情况,比如元数据的信息部分缺失、或者待读取文件的数据内容发生了改变等等,从而影响计算节点210 根据元数据并发读取文件的效率,因此元数据还可以包括校验信息,该校验信息用于提高元数据的可靠性。In one embodiment, due to data missing or changed in the storage node 220, for example, the information part of the metadata is missing, or the data content of the file to be read is changed, etc., which affects the computing node 210 According to the efficiency of concurrently reading the file based on the metadata, the metadata may also include verification information, which is used to improve the reliability of the metadata.
可选的,如图5所示,元数据除了包括上述基础信息610,还可以包括校验信息620,其中,校验信息620包括:Optionally, as shown in FIG. 5, in addition to the above-mentioned basic information 610, the metadata may also include verification information 620, where the verification information 620 includes:
(4)校验掩码,用于供计算节点210确认此处为元数据的头部,因此校验掩码位于元数据的头部,计算节点210从元数据头部开始读取元数据时,可以先对元数据头部的校验掩码进行校验,本申请不作具体限定。如果计算节点210对校验掩码校验成功,证明当前读指针所处的位置是元数据的头部,计算节点210可以开始读取元数据,并根据元数据调用多个线程并发读取待读取文件;如果计算节点210对校验掩码校验失败,则表示当前指针所处的位置不是元数据的头部,计算节点210可以不再使用元数据进行待读取文件的读取,而是调用切片读取单元212按照当前业界已有的数据处理方法进行待读取文件的读取,本申请不对此进行限。具体实现中,该校验掩码可以用二进制数值表示,加快处理效率;(4) The check mask is used for the computing node 210 to confirm that this is the header of the metadata. Therefore, the check mask is located at the header of the metadata. When the computing node 210 starts to read the metadata from the metadata header , The check mask of the metadata header can be checked first, and this application does not make specific restrictions. If the computing node 210 succeeds in verifying the check mask, it proves that the current position of the read pointer is the head of the metadata, the computing node 210 can start to read the metadata, and call multiple threads to read the pending data concurrently according to the metadata. Read the file; if the computing node 210 fails to verify the verification mask, it means that the current pointer is not at the head of the metadata, and the computing node 210 can no longer use the metadata to read the file to be read. Instead, the slice reading unit 212 is called to read the file to be read according to the current data processing method in the industry, and this application is not limited to this. In specific implementation, the check mask can be represented by a binary value to speed up the processing efficiency;
(5)元数据校验值,用于检查元数据信息内容是否发生改变;(5) Metadata check value, used to check whether the content of metadata information has changed;
(6)文件校验值,用于检查待读取文件中的数据内容是否发生改变;(6) The file check value is used to check whether the data content in the file to be read has changed;
(7)元数据格式版本,用于记录当前元数据信息的格式版本,计算节点读取元数据时,如果不支持读取最新格式的元数据信息,还可以兼容旧版本的文件;(7) Metadata format version, used to record the format version of the current metadata information. When the computing node reads the metadata, if it does not support reading the metadata information in the latest format, it can also be compatible with the old version of the file;
(8)文件格式版本,用于记录当前待读取文件的格式信息。(8) The file format version is used to record the format information of the file currently to be read.
值得注意的是,计算节点210读取元数据时,可以先读取校验信息620,确认元数据和待读取文件的数据内容没有发生改变、且版本格式也兼容后,可以再读取基础信息610,然后调用多个线程并发读取待读取文件,因此,图5所示的元数据格式中校验信息620位于基础信息610之前,当然,也可以使用其他方式确保计算节点先读取校验信息620再读取元数据的其他信息,本申请不作具体限定。It is worth noting that when the computing node 210 reads the metadata, it can read the verification information 620 first, and after confirming that the metadata and the data content of the file to be read have not changed, and the version format is compatible, the basic data can be read again. Information 610, and then call multiple threads to read the file to be read concurrently. Therefore, the verification information 620 in the metadata format shown in FIG. 5 is located before the basic information 610. Of course, other methods can also be used to ensure that the computing node reads first The verification information 620 then reads other information of the metadata, which is not specifically limited in this application.
应理解,图5中的校验信息(4)~(8)用于举例说明,元数据还可以包括更多或者更少种类的校验信息以确保元数据的可靠性,这里不作具体限定。而校验上述(4)~(6)使用的方法可以使用业内通用的校验方法,比如哈希(hash)校验、sha256校验等等,本申请不对此进行具体限定。It should be understood that the verification information (4) to (8) in FIG. 5 are used for illustration, and the metadata may also include more or fewer types of verification information to ensure the reliability of the metadata, which is not specifically limited here. The verification methods used in (4) to (6) above can use verification methods commonly used in the industry, such as hash verification, sha256 verification, etc., which are not specifically limited in this application.
在一实施例中,计算节点在读取不同数据类型的待读取文件时所需的信息不同,比如在AI领域中,待读取文件的数据类型通常为稠密矩阵或者稀疏矩阵,当待读取文件的数据类型为稠密矩阵时,计算节点210需要根据稠密矩阵每列特征值的字符串类型来初始化内存数据结构,以保证待读取文件不会解析错误以及丢失;而当待读取文件的数据类型为稀疏矩阵时,计算节点210则不需要获取矩阵每列特征值,而要根据稀疏矩阵值的数量来申请用于分别保存“数据值”以及“数据列索引”的内存空间,因此不同类型的元数据格式也会有改变,下面以稠密矩阵这一数据类型为例,对元数据格式进行描述。In one embodiment, the computing node needs different information when reading files to be read of different data types. For example, in the AI field, the data type of the file to be read is usually a dense matrix or a sparse matrix. When the data type of the file is a dense matrix, the computing node 210 needs to initialize the memory data structure according to the string type of the characteristic value of each column of the dense matrix to ensure that the file to be read will not be parsed or lost; and when the file to be read When the data type of is a sparse matrix, the computing node 210 does not need to obtain the eigenvalues of each column of the matrix. Instead, it needs to apply for memory space for storing the "data value" and "data column index" according to the number of sparse matrix values. Different types of metadata formats will also change. The following uses the dense matrix data type as an example to describe the metadata format.
可选的,如图5所示,元数据除了包括上述基础信息610以及校验信息620,还可以包括类型信息630,应理解,不同数据类型的元数据有着不同的类型信息630,图5这里以数据类型为“稠密矩阵”为例进行说明,数据类型为稠密矩阵的情况下,该类型信息630包括:Optionally, as shown in FIG. 5, in addition to the above-mentioned basic information 610 and verification information 620, the metadata may also include type information 630. It should be understood that metadata of different data types has different type information 630, as shown in FIG. 5. Taking the data type as a "dense matrix" as an example for description, when the data type is a dense matrix, the type information 630 includes:
(9)数据类型,用于描述待读取文件的数据类型的名称,图5这里以数据类型为“稠密矩阵”为例进行说明。(9) Data type, used to describe the name of the data type of the file to be read, Figure 5 here takes the data type as "dense matrix" as an example for illustration.
(10)特征值类型,用于描述稠密矩阵的特征值的类型,比如类型是字符串等等,不同类型的特征值需要不同数据结构的内存空间来存储,因此计算节点210可以根据稠密矩阵的特征值的类型来初始化内存空间的数据结构,以保证待读取文件不会解析错误以及丢失。(10) Eigenvalue type, used to describe the eigenvalue type of the dense matrix, such as the type is a string, etc., different types of eigenvalues need different data structure memory space to store, so the computing node 210 can be based on the dense matrix The type of the characteristic value initializes the data structure of the memory space to ensure that the file to be read will not be parsed or lost.
值得注意的是,由于计算节点210读取不同数据类型的待读取文件,将会执行不同的读取逻辑对待读取文件进行读取,比如稠密矩阵需要额外初始化内存空间的数据结构,因此图5中数据类型630位于基础信息610之前,这样,计算节点210先根据校验信息620对元数据和待读取文件进行校验,然后根据类型信息630确定计算节点210的读取逻辑,最后根据基础信息610和读取逻辑,调用多个线程并发读取待读取文件。当然,也可以使用其他方式确保读取各种元数据信息的顺序,本申请不作具体限定。It is worth noting that since the computing node 210 reads files to be read of different data types, it will execute different reading logic to read the files to be read. For example, a dense matrix requires additional initialization of the data structure of the memory space. The data type 630 in 5 is located before the basic information 610. In this way, the computing node 210 first verifies the metadata and the file to be read according to the verification information 620, and then determines the reading logic of the computing node 210 according to the type information 630, and finally The basic information 610 and reading logic call multiple threads to concurrently read the file to be read. Of course, other methods can also be used to ensure the order of reading various metadata information, which is not specifically limited in this application.
应理解,待读取文件的数据类型不同,元数据格式也不同,类型信息630中的内容也不同,举例来说,如图6所示,如果元数据的(9)数据类型为“稀疏矩阵”,类型信息630将不包括(10),而额外包括:It should be understood that the data type of the file to be read is different, the metadata format is also different, and the content in the type information 630 is also different. For example, as shown in FIG. 6, if the (9) data type of the metadata is "sparse matrix ", the type information 630 will not include (10), but additionally include:
(11)值数量,用于保存稀疏矩阵的值的数量,计算节点210可以根据稀疏矩阵值的数量申请内存空间。应理解,由于稀疏矩阵的存储形式为:总共包括3行字符,每个数据都通过该3行字符来保存,一行字符表示每个数据对应的“数据列索引”,一行字符表示每个数据对应的“数据值”,一行字符表示每个数据对应的“行数据量”,因此,对于稀疏矩阵来说,(1)行数用来申请存放“行数据量”的第一内存空间,(11)值数量用来申请存放“数据值”以及“数据列索引”的第二内存空间。(11) The number of values is used to store the number of values of the sparse matrix. The computing node 210 can apply for memory space according to the number of values of the sparse matrix. It should be understood that since the storage form of the sparse matrix is: a total of 3 rows of characters are included, and each data is saved by the 3 rows of characters, a row of characters represents the "data column index" corresponding to each data, and a row of characters represents each data corresponding "Data value", a line of characters represents the "row data amount" corresponding to each data. Therefore, for a sparse matrix, (1) the number of rows is used to apply for the first memory space for storing the "row data amount", (11 ) Value quantity is used to apply for the second memory space for storing "data value" and "data column index".
并且,数据类型为稀疏矩阵的待读取文件的元数据的基础信息610中,(3)每个切片的起始位置将进一步分为:In addition, in the basic information 610 of metadata of the file to be read whose data type is a sparse matrix, (3) the starting position of each slice will be further divided into:
(3.1)每个切片的数据列索引起始位置;(3.1) The starting position of the data column index of each slice;
(3.2)每个切片的数据值起始位置;(3.2) The starting position of the data value of each slice;
(3.3)每个切片的行数据量起始位置。(3.3) The starting position of the row data amount of each slice.
这样,每个线程可以根据一个切片三行数据的起始位置读取该切片的数据量索引、数据值以及对应的行数据量,并将该切片按稀疏矩阵的三行格式写入上述申请的内存空间,具体地,计算节点210可以根据每个切片的数据列索引起始位置、每个切片的数据值起始位置以及每个切片的行数据量起始位置,调用多个线程并发读取每个切片的数据值以及每个切片的数据列索引至第一内存空间,调用多个线程并发读取每个切片的行数据量至第二内存空间,获得待读取文件实现多个线程并发读取多个切片的目的。In this way, each thread can read the data volume index, data value, and corresponding row data volume of a slice according to the starting position of the three-row data of a slice, and write the slice in the three-row format of the sparse matrix to the above-mentioned application The memory space, specifically, the computing node 210 can call multiple threads to read concurrently according to the starting position of the data column index of each slice, the starting position of the data value of each slice, and the starting position of the row data volume of each slice. The data value of each slice and the data column of each slice are indexed to the first memory space, and multiple threads are called to concurrently read the row data amount of each slice to the second memory space, and the file to be read is obtained to realize multiple threads concurrency The purpose of reading multiple slices.
在一实施例中,出于对处理器处理性能的考虑,在一些应用场景中,计算节点210在读取数据类型为稀疏矩阵的待读取文件时,可以将待读取文件的数据类型由稀疏矩阵转化为稠密矩阵后存储在内存空间,该转化过程中,计算节点210需要提前获知稀疏矩阵的列数以及每个数据的原始行数,这里的原始行数指的是原始数据被转化为稀疏矩阵存储在存储节点220之前,在原始数据中所处行的行数,因此,在数据类型为稀疏矩阵的情况下,类型信息630还可以包括(12)列数,(3.3)每个切片的行数据量起始位置包括每个切片的行数据量的偏移值以及原始行数,这样,每个线程可以根据每个切片三行数据的起始位置读取该切片的数据量索引、数据值以及对应的行数据量,并将该切片按原始数据的行数和列数写入内存空间,实现多个线程并发读取稀疏矩阵的多个切片,并将稀疏矩阵转换为稠密矩阵写入内存空间的目的。In an embodiment, in consideration of processor processing performance, in some application scenarios, when the computing node 210 reads a file to be read whose data type is a sparse matrix, the data type of the file to be read may be changed from The sparse matrix is converted into a dense matrix and then stored in the memory space. In the conversion process, the computing node 210 needs to know the number of columns of the sparse matrix and the original number of rows of each data in advance. The original number of rows here refers to the original data being converted into The number of rows in the original data where the sparse matrix is stored before the storage node 220. Therefore, when the data type is a sparse matrix, the type information 630 may also include (12) the number of columns and (3.3) each slice The starting position of the row data volume of each slice includes the offset value of the row data volume of each slice and the original number of rows. In this way, each thread can read the data volume index of the slice according to the starting position of the three rows of data of each slice, The data value and the corresponding row data amount, and the slice is written into the memory space according to the number of rows and columns of the original data, so that multiple threads can read multiple slices of the sparse matrix concurrently, and convert the sparse matrix into a dense matrix. The purpose of entering the memory space.
应理解,图5~图6所示的元数据格式仅用于举例说明,具体实现中,本申请提供的方案不仅适用于上述数据类型(稀疏矩阵和稠密矩阵),还适用于其他可以被逐条或者分批读取的数据类型,比如Libsvm格式的数据,这里不再一一进行举例和展开说明。并且,不同数据类型的元数据还可以包括更多或者更少的内容,具体可以根据计算节点在读取待读取文件时所需的信息来确定元数据需要包含的内容,这里不一一展开赘述。S520:存储元数据和待读取 文件。It should be understood that the metadata formats shown in Figures 5 to 6 are only used for illustration. In specific implementations, the solution provided by this application is not only applicable to the above-mentioned data types (sparse matrix and dense matrix), but also applicable to other data types that can be itemized. Or data types read in batches, such as data in Libsvm format, will not be given examples and explanations one by one here. In addition, the metadata of different data types can also include more or less content. Specifically, the content that the metadata needs to contain can be determined according to the information required by the computing node when reading the file to be read, which will not be expanded here. Go into details. S520: Store metadata and files to be read.
存储节点220将元数据存储在指定路径,或者,将元数据存储在待读取文件的存储位置,其中,待读取文件和待读取文件的元数据包含共同标识,比如待读取文件和待读取文件的元数据的文件名相同,但是扩展名不同。举例来说,待读取文件(dataA.exp)的存储路径为/pathA/pathB/…/pathN/dataA.exp,其中,exp为待读取文件的通用数据格式,具体可以是csv、libsvm等等,假设元数据扩展名为metadata,该待读取文件的元数据(dataA.metadata)的存储路径为pathA/pathB/…/pathN/dataA.metadata。这样,当计算节点210读取待读取文件时,可以直接从待读取文件的读取路径中查找包含共同标识的查找待读取文件对应的元数据。当然,存储节点220也可以将所有文件的元数据存储在指定路径,计算节点21读取待读取文件时,可以从指定路径根据共同标识查找待读取文件对应的元数据。The storage node 220 stores the metadata in a designated path, or stores the metadata in the storage location of the file to be read, where the metadata of the file to be read and the file to be read contain a common identifier, such as the file to be read and the file to be read. The file name of the metadata of the file to be read is the same, but the extension is different. For example, the storage path of the file to be read (dataA.exp) is /pathA/pathB/.../pathN/dataA.exp, where exp is the general data format of the file to be read, specifically csv, libsvm, etc. Etc., assuming that the metadata extension is metadata, the storage path of the metadata (dataA.metadata) of the file to be read is pathA/pathB/.../pathN/dataA.metadata. In this way, when the computing node 210 reads the file to be read, it can directly search for the metadata corresponding to the file to be read that contains the common identifier from the reading path of the file to be read. Of course, the storage node 220 may also store the metadata of all files in a specified path. When the computing node 21 reads the file to be read, it may search for the metadata corresponding to the file to be read from the specified path according to the common identifier.
可选的,存储节点220还可以将待读取文件的元数据存储于待读取文件中,待读取文件的末尾包括元数据在待读取文件中的起始位置,这样,计算节点210读取元数据时,可以直接从待读取文件的末尾先逆向读取一定长度的数据确定元数据的头部在待读取文件中的位置,具体可以是元数据头部偏移值,然后将读指针设置于该元数据头部偏移值进行读取,从而获得待读取文件的元数据。Optionally, the storage node 220 may also store the metadata of the file to be read in the file to be read, and the end of the file to be read includes the starting position of the metadata in the file to be read. In this way, the computing node 210 When reading metadata, you can read a certain length of data directly from the end of the file to be read to determine the position of the header of the metadata in the file to be read, which can be the offset value of the metadata header, and then Set the read pointer to the offset value of the metadata header for reading, thereby obtaining the metadata of the file to be read.
示例性地,元数据追加在待读取文件的尾部后,包含有元数据的待读取文件的格式可以如图7所示。其中,假设原文件共有N行数据,元数据被追加在待读取文件的尾部,并且,元数据的末尾也追加了(13)校验掩码和(14)元数据头部偏移位置,其中,Exemplarily, the metadata is appended to the end of the file to be read, and the format of the file to be read containing the metadata may be as shown in FIG. 7. Among them, assuming that the original file has a total of N lines of data, metadata is appended to the end of the file to be read, and (13) check mask and (14) metadata header offset position are also appended to the end of the metadata. in,
(13)校验掩码,校验掩码一般位于“(14)元数据头部偏移位置”之前,用于供计算节点210确认(14)的首位,计算节点210可以从待读取文件的尾部逆向读取一定范围的内容,确定该范围内的内容是否存在目标格式的(13)校验掩码,如果存在目标格式的(13)校验掩码,即可接着读取(14)元数据头部偏移位置;(13) Check mask. The check mask is generally located before "(14) Metadata Head Offset Position", and is used for the computing node 210 to confirm the first position of (14). The computing node 210 can read the file from the file to be read. Read a certain range of content in the reverse direction at the end of the to determine whether the content in the range has a check mask of the target format (13), if there is a check mask of the target format (13), you can continue to read (14) The offset position of the metadata header;
(14)元数据头部偏移位置,用于供计算节点210确定元数据的头部在待读取文件中的位置,图7所示的例子中,元数据头部的偏移位置可以是第N+1行。(14) The offset position of the metadata header is used for the computing node 210 to determine the position of the metadata header in the file to be read. In the example shown in FIG. 7, the offset position of the metadata header may be Line N+1.
简单来说,计算节点210在读取待读取文件时,可以先将读指针设置到文件尾部,然后以逆向的方式读取尾部文件一定范围内的内容,并对其进行模式匹配,确定该范围内的内容是否存在目标格式的校验掩码,如果不存在目标格式的校验掩码,计算节点210将以业内通用的数据处理方法读取待读取文件,如果存在目标格式的校验掩码,则将读取指针设置于该校验掩码,正向读取数据获得元数据头部偏移位置信息,再将读指针设置到该偏移位置,然后读取元数据,根据元数据调用多个线程并发读取待读取文件。Simply put, when the computing node 210 reads the file to be read, it can first set the read pointer to the end of the file, and then read the content in a certain range of the tail file in a reverse manner, and perform pattern matching on it to determine the Whether there is a check mask in the target format for the content in the range, if there is no check mask in the target format, the computing node 210 will read the file to be read using a data processing method commonly used in the industry, and if there is a check mask in the target format Mask, set the read pointer to the check mask, read the data in the forward direction to obtain the offset position information of the metadata header, then set the read pointer to the offset position, and then read the metadata. The data calls multiple threads to read the file to be read concurrently.
举例来说,校验掩码可以是“#HWBDFORMAT”,元数据信息头部偏移位置可以是#12345678,计算节点210读取待读取文件时,可以先将读指针设置于文件尾部,然后以逆向的方式读取尾部文件一定范围内的内容,确定该范围内的内容是否存在#HWBDFORMAT这一固定格式,如果存在该格式的校验掩码,接着读取校验掩码之后的(14)元数据头部偏移位置,然后将指针设置到该偏移位置“12345678”,开始读取元数据。For example, the check mask can be "#HWBDFORMAT", and the offset position of the metadata information header can be #12345678. When the computing node 210 reads the file to be read, it can first set the read pointer at the end of the file, and then Reversely read the content in a certain range of the tail file to determine whether the content in the range has the fixed format #HWBDFORMAT, if there is a check mask in this format, then read the (14 ) The metadata header offset position, and then set the pointer to the offset position "12345678" to start reading metadata.
需要说明的,图7所示的包含有元数据的待读取文件的格式仅用于举例说明,本申请不作具体限定。It should be noted that the format of the file to be read containing metadata shown in FIG. 7 is only for illustration, and this application does not specifically limit it.
本申请提供了上述两种元数据存储的方式,具体实现中,可以根据应用环境来选择元数据存储的方式,可以理解的,将元数据以相同文件名存储在待读取文件的存储路径下这一方法,计算节点的数据处理逻辑无需作出任何修改,可复用性强,但是会增加存储节点220文件管理的负担;而将元数据直接追加在待读取文件的末尾这一方法,无需生成多余的文件, 便于存储节点220的文件管理,但是需要修改计算节点的数据处理逻辑,使得计算节点可以先从文件末尾读取元数据,再根据元数据读取待读取文件,如果计算节点210无法修改数据处理逻辑,则需要将元数据剥离后才可以供计算节点210使用。因此,具体实现中,可以根据应用环境来灵活确定元数据的存储方式,使得本申请提供的数据处理方法和数据处理方法应用更加广泛。This application provides the above two metadata storage methods. In specific implementation, the metadata storage method can be selected according to the application environment. It is understandable that the metadata is stored under the same file name in the storage path of the file to be read In this method, the data processing logic of the computing node does not need to be modified, and the reusability is strong, but it will increase the burden of file management on the storage node 220; and the method of directly appending the metadata to the end of the file to be read does not require Generate redundant files to facilitate file management of the storage node 220, but the data processing logic of the computing node needs to be modified so that the computing node can first read the metadata from the end of the file, and then read the file to be read based on the metadata. If the computing node 210 cannot modify the data processing logic, and the metadata needs to be stripped before it can be used by the computing node 210. Therefore, in specific implementation, the storage mode of metadata can be flexibly determined according to the application environment, so that the data processing method and the data processing method provided in this application are more widely used.
可以理解的,本申请提供的数据处理方法中,由存储节点220提前对待读取文件进行解析,根据待读取文件的数据类型确定待读取文件的元数据格式,生成用于读取该待读取文件的元数据,然后将待读取文件的元数据存储下来,使得计算节点在读取文件时,可以根据待读取文件的元数据,有效初始化内存的数据结构,并发读取待读取文件,提高文件的读取效率。并且,元数据的可扩展性很强,可以根据各种类型的数据在读取时所需的各种信息,对元数据进行进一步的追加和丰富,使得本申请提供的方案的适用性非常广泛。It is understandable that in the data processing method provided by the present application, the storage node 220 parses the file to be read in advance, determines the metadata format of the file to be read according to the data type of the file to be read, and generates a metadata format for reading the file to be read. Read the metadata of the file, and then store the metadata of the file to be read, so that when the computing node reads the file, it can effectively initialize the data structure of the memory according to the metadata of the file to be read, and read the file to be read concurrently Fetch files to improve the efficiency of file reading. In addition, metadata is highly scalable, and metadata can be further added and enriched according to various information required when reading various types of data, making the applicability of the solution provided by this application very broad .
下面对计算节点210读取待读取文件的方法进行解释说明。本申请提供的数据处理方法可应用于图4所述的数据处理系统400的计算节点210中,如图8所示,该方法包括以下步骤:The method for the computing node 210 to read the file to be read will be explained below. The data processing method provided in this application can be applied to the computing node 210 of the data processing system 400 described in FIG. 4, as shown in FIG. 8, the method includes the following steps:
S810:计算节点210从存储节点220获取待读取文件的元数据,其中,待读取文件的元数据包括待读取文件的切片数量、行数、以及每个切片在待读取文件中的起始位置。S810: The computing node 210 obtains metadata of the file to be read from the storage node 220, where the metadata of the file to be read includes the number of slices, the number of rows, and the number of slices in the file to be read. starting point.
参考前述内容可知,元数据的存储方式有两种,因此,计算节点210获取待读取文件的元数据的方法也有两种,下面将分别对两种元数据获取方法进行解释说明。With reference to the foregoing content, there are two ways to store metadata. Therefore, there are also two ways for the computing node 210 to obtain the metadata of the file to be read. The two metadata obtaining methods will be explained separately below.
在一实施例中,如果存储节点220存储元数据的方式为:待读取文件的元数据存储在所述存储节点的指定路径,或者,待读取文件的元数据存储位置与待读取文件的存储为值相同,那么在存储待读取文件的元数据时,待读取文件和待读取文件的元数据包括共同标识,比如待读取文件和对应元数据的文件名相同,格式不同,此时步骤S810可以包括以下步骤:计算节点210从存储节点220获取待读取文件的共同标识比如待读取文件的文件名,然后根据待读取文件的文件名,从指定路径或者待读取文件的存储位置获取该待读取文件的元数据。如果元数据文件存在,则读取该元数据文件,并根据元数据文件申请内存空间、创建线程,调用线程并发读取待读取文件;如果元数据文件不存在,则使用业内通用的数据处理方法进行数据处理,本申请不作具体限定。In one embodiment, if the storage node 220 stores metadata in a manner that the metadata of the file to be read is stored in the specified path of the storage node, or the metadata storage location of the file to be read is different from the metadata storage location of the file to be read. Is stored as the same value, then when storing the metadata of the file to be read, the metadata of the file to be read and the file to be read include a common identifier, for example, the file to be read and the corresponding metadata have the same file name and different formats At this time, step S810 may include the following steps: the computing node 210 obtains from the storage node 220 the common identifier of the file to be read, such as the file name of the file to be read, and then according to the file name of the file to be read, from the specified path or the file to be read Get the storage location of the file to obtain the metadata of the file to be read. If the metadata file exists, read the metadata file, apply for memory space based on the metadata file, create threads, and call threads to concurrently read the file to be read; if the metadata file does not exist, use the data processing common in the industry The method for data processing is not specifically limited in this application.
仍以前述例子为例,假设存储节点220生成了待读取文件dataA.exp以及对应的元数据dataA.metadata,即待读取文件和元数据之间共同标识为相同的文件名,然后将待读取文件和元数据一同存储在/pathA/pathB/…/pathN,那么计算节点210在读取待读取文件dataA.exp时,可以在dataA.exp的存储路径/pathA/pathB/…/pathN下查找与待读取文件名相同的元数据,也就是dataA.metadata,或者根据存储路径/pathA/pathB/…/pathNdataA.metadata查找元数据文件是否存在,如果元数据文件存在,则读取该文件,并根据元数据读取信息,如果元数据文件不存在,则使用业内通用的数据处理方法进行数据处理,本申请不作具体限定。Still taking the foregoing example as an example, suppose that the storage node 220 generates the file to be read dataA.exp and the corresponding metadata dataA.metadata, that is, the file to be read and the metadata are jointly identified as the same file name, and then the file to be read is identified as the same file name. The read file and metadata are stored together in /pathA/pathB/.../pathN, then when the computing node 210 reads the file dataA.exp to be read, it can be stored in the storage path /pathA/pathB/.../pathN of dataA.exp Find the metadata with the same name as the file to be read, that is, dataA.metadata, or find whether the metadata file exists according to the storage path /pathA/pathB/.../pathNdataA.metadata, if the metadata file exists, read it File, and read information based on metadata. If the metadata file does not exist, data processing methods commonly used in the industry are used for data processing, and this application does not specifically limit it.
在一实施例中,如果存储节点220存储元数据的方式为:读取文件的元数据存储于待读取文件中,比如待读取文件的尾部,此时步骤S810可以包括以下步骤:从待读取文件的末尾获得元数据在待读取文件中的起始位置,具体可以是元数据头部的偏移值,根据该元数据头部的偏移值读取元数据。In an embodiment, if the storage node 220 stores metadata in a manner that the metadata of the read file is stored in the file to be read, such as the end of the file to be read, step S810 may include the following steps: Read the end of the file to obtain the starting position of the metadata in the file to be read, which can be specifically the offset value of the metadata header, and read the metadata according to the offset value of the metadata header.
仍以图7所示的内容格式为例,计算节点210在读取如图7所示的格式的待读取文件时,可以先将读指针设置到文件尾部,然后以逆向的方式读取尾部文件一定范围内的内容,并对其进行模式匹配,确定该范围内的内容是否存在目标格式的(13)校验掩码,如果不存在目 标格式的(13)校验掩码,计算节点210将以业内通用的数据处理方法读取待读取文件,如果存在目标格式的(13)校验掩码,则接着读取(13)校验掩码之后的(14)元数据头部偏移位置,将读指针设置到该偏移位置,然后读取元数据。Still taking the content format shown in FIG. 7 as an example, when the computing node 210 reads the file to be read in the format shown in FIG. 7, it can first set the read pointer to the end of the file, and then read the end in a reverse manner. The content within a certain range of the file, and pattern matching is performed to determine whether the content in the range has the (13) check mask of the target format. If there is no (13) check mask of the target format, the computing node 210 The file to be read will be read using the data processing method commonly used in the industry. If the (13) check mask of the target format exists, then (13) the metadata header offset after the check mask (14) will be read Position, set the read pointer to the offset position, and then read the metadata.
需要说明的,无论使用何种方式获取元数据时,如果元数据文件不存在,计算节点210可以使用业内通用的数据处理方法进行数据处理,并对待读取文件进行数据解析,然后将解析结果返回至存储节点220,以供存储节点220根据解析结果生成待读取文件的元数据,这样,当其他计算节点210读取待读取文件时,存储节点220可以将元数据返回至该计算节点210,以供该计算节点根据元数据并发读取待读取文件。It should be noted that no matter what method is used to obtain the metadata, if the metadata file does not exist, the computing node 210 can use the data processing method commonly used in the industry for data processing, and perform data analysis on the file to be read, and then return the analysis result To the storage node 220, so that the storage node 220 generates metadata of the file to be read according to the analysis result. In this way, when another computing node 210 reads the file to be read, the storage node 220 can return the metadata to the computing node 210 , So that the computing node concurrently reads the file to be read based on the metadata.
S820:计算节点根据每个切片在待读取文件中的起始位置,调用多个线程,并发读取每个切片的数据,其中,多个线程是计算节点根据切片数量创建的。S820: The computing node calls multiple threads according to the starting position of each slice in the file to be read, and concurrently reads the data of each slice, where the multiple threads are created by the computing node according to the number of slices.
可选地,线程的数量y可以等于切片数量x。此时每个线程处理一个切片,y个线程可以并行读取待读取文件,达到极佳的处理状态,使得计算节点读取文件的速度得到极大提升,进一步提升大数据和AI任务的处理效率。Optionally, the number of threads y may be equal to the number of slices x. At this time, each thread processes a slice, and y threads can read the file to be read in parallel to achieve an excellent processing state, which greatly improves the speed of the computing node to read the file, and further improves the processing of big data and AI tasks efficient.
可选地,线程的数量y可以小于切片数量x。参考前述内容可知,待读取文件的切片数量x是根据计算节点210的硬件处理能力确定的,而在计算节点210读取待读取文件时,计算节点210当前可能部分核在处理其他事项,比如正在进行大数据任务或者AI任务,此时计算节点210可创建的线程数量y可以小于切片数量x。Optionally, the number of threads y may be less than the number of slices x. With reference to the foregoing, it can be seen that the number of slices x of the file to be read is determined according to the hardware processing capability of the computing node 210, and when the computing node 210 reads the file to be read, the computing node 210 may currently partially process other matters. For example, a big data task or an AI task is in progress. At this time, the number of threads y that the computing node 210 can create may be less than the number of slices x.
举例来说,如果元数据显示待读取文件的最大切片数量为10,计算节点210的内核数量为10,如果当前计算节点210全部内核都处于空闲状态,计算节点210可以直接创建10个线程,调用10个线程并行读取待读取文件的切片,达到最佳的处理状态,计算节点读取文件的速度最快,处理效率最高;如果计算节点210当前3个内核在处理大数据任务,只有7个内核处于空闲状态,那么计算节点210可以创建7个线程G1~G7,调用7个线程并发读取待读取文件的10个切片。应理解,上述举例仅用于说明,本申请不作具体限定。For example, if the metadata shows that the maximum number of slices of the file to be read is 10 and the number of cores of the computing node 210 is 10, if all the cores of the current computing node 210 are idle, the computing node 210 can directly create 10 threads. Call 10 threads to read the slices of the file to be read in parallel to achieve the best processing state. The computing node reads the file the fastest and has the highest processing efficiency; if the current 3 cores of the computing node 210 are processing big data tasks, only If the 7 cores are in an idle state, the computing node 210 can create 7 threads G1 to G7, and call the 7 threads to concurrently read 10 slices of the file to be read. It should be understood that the above examples are only for illustration, and this application does not make specific limitations.
S830:计算节点按照每个切片在待读取文件中的起始位置的顺序,将每个切片的数据存储至内存空间,其中,内存空间是计算节点根据行数申请得到的。S830: The computing node stores the data of each slice in the memory space according to the order of the starting position of each slice in the file to be read, where the memory space is obtained by the computing node according to the number of rows.
参考前述内容可知,每个切片在待读取文件中的起始位置可以是每个切片在待读取文件中起始位置的偏移值和行号,因此,每个线程读取到切片的数据后,可以根据该切片起始位置的偏移值的大小顺序或者行号的大小顺序,调用多个线程将多个切片并发写入该内存空间。With reference to the foregoing content, the starting position of each slice in the file to be read can be the offset value and line number of the starting position of each slice in the file to be read. Therefore, each thread reads the value of the slice. After data is collected, multiple threads can be called to write multiple slices into the memory space concurrently according to the size sequence of the offset value of the start position of the slice or the size sequence of the row number.
具体实现中,当创建的线程数量少于切片数量的情况下,可以先一个线程处理一个切片,然后在每个线程读取完1个切片后,继续从剩余切片中读取下一个切片,直至所有切片被读取完毕。仍以上述例子为例,计算节点210创建了7个线程G1~G7来读取待读取文件,而待读取文件的切片数量为10,那么线程G1~G7可以先并发读取切片1~切片7,则线程1处理完切片1以后,继续从剩余切片中取一个切片进行读取,比如切片8待处理,则线程1处理完切片1后继续处理切片8,其他线程也按照同样策略去执行,直到处理完所有切片。应理解,上述举例仅用于说明,本申请不作具体限定。In specific implementation, when the number of threads created is less than the number of slices, one thread can process one slice first, and then after each thread finishes reading 1 slice, continue to read the next slice from the remaining slices until All slices have been read. Still taking the above example as an example, the computing node 210 creates 7 threads G1 to G7 to read the file to be read, and the number of slices of the file to be read is 10, then the threads G1 to G7 can concurrently read slices 1 to 1 first. Slice 7, after thread 1 processes slice 1, continue to take a slice from the remaining slices for reading. For example, slice 8 is to be processed, then thread 1 processes slice 1 and continues to process slice 8, and other threads follow the same strategy. Execute until all slices are processed. It should be understood that the above examples are only for illustration, and this application does not make specific limitations.
具体实现中,当创建的线程数量少于切片数量的情况下,还可以部分线程只处理一个切片,部分线程处理多个切片,从而达到并行处理多个切片的目的。参考前述内容可知,切片的起始位置可以包括每个切片的起始位置在待读取文件中的偏移值和行号,每个线程可以根据要读取的切片的起始位置的行号和下一个切片的起始位置的行号,确定要读取的切片的长度,这样,部分线程可以根据当前切片的长度和下一个切片的长度,从当前切片的起始位读取多个切片。仍以上述例子为例,线程数量为7,而切片数量为10,那么可以分配4个切片 供线程1~线程4并发读取,分配6个切片供线程5~线程7并发读取,其中,线程5可以从第5个切片的起始位置读取到第7个切片的起始位置,线程6可以从第7个切片的起始位置读取到第9个切片的起始位置,线程7可以是从第9个切片的起始位置读取到文件结尾。应理解,上述举例用于说明,本申请不作具体限定。In specific implementation, when the number of threads created is less than the number of slices, some threads can process only one slice, and some threads process multiple slices, so as to achieve the purpose of processing multiple slices in parallel. With reference to the foregoing content, the starting position of the slice can include the offset and line number of the starting position of each slice in the file to be read, and each thread can be based on the line number of the starting position of the slice to be read. And the line number of the starting position of the next slice to determine the length of the slice to be read, so that some threads can read multiple slices from the starting position of the current slice according to the length of the current slice and the length of the next slice . Still taking the above example as an example, the number of threads is 7 and the number of slices is 10, then 4 slices can be allocated for concurrent reading by thread 1 to thread 4, and 6 slices can be allocated for concurrent reading by thread 5 to thread 7. Among them, Thread 5 can read from the start position of the fifth slice to the start position of the seventh slice, thread 6 can read from the start position of the seventh slice to the start position of the ninth slice, thread 7 It can be read from the beginning of the 9th slice to the end of the file. It should be understood that the above examples are used for illustration, and this application is not specifically limited.
举例来说,如图9所示,假设待读取文件共有9行数据,每行数据分别用L 1~L 9来表示,假设待读取文件的元数据为:(1)行数=9;(2)切片数量=3;(3)每个切片的起始位置=切片1的偏移值w 1以及行号1;切片2的偏移值w 4以及行号4;切片3的偏移值w 7以及行号7。因此,如图9所示,计算节点210读取待读取文件的元数据后,计算节点210可以根据切片数量3申请3个线程G1~G3,然后根据行数9从内存109中申请一段能够容纳9行数据的内存空间n 0,再调用3个线程并发读取待读取文件至内存空间n 0。其中,线程G1读取切片1,线程G2读取切片2,线程G3读取切片3,具体地,线程G1根据切片1的行号1和下一个切片(切片2)的行号4确定切片1的长度为3行,线程G2根据切片2的行号3和下一个切片(切片3)的行号7确定切片2的长度为3行,线程G3根据切片4的行号7和总行数9确定切片3的长度为3行,然后线程G1将读指针设置到偏移值w 1并读取3行数据L 1~L 3至内存空间n 0的前三行,线程G2将读指针设置到偏移值w 4并读取3行数据L 4~L 6至内存空间n 0的3至6行,线程G3将读指针设置到偏移值w 7并读取3行数据L 7~L 9至内存空间n 0的最后三行,其中,线程G1、G2以及G3并发处理上述任务,从而完成一次文件的并发读取。 For example, as shown in Figure 9, suppose there are 9 rows of data in the file to be read, and each row of data is represented by L 1 to L 9 respectively. Assume that the metadata of the file to be read is: (1) Number of rows = 9 ; (2) the number of slices = 3; (3) the starting position of each slice = the offset value w 1 of slice 1 and the line number 1; the offset value w 4 of slice 2 and the line number 4; the offset of slice 3 Shift value w 7 and line number 7. Therefore, as shown in FIG. 9, after the computing node 210 reads the metadata of the file to be read, the computing node 210 can apply for 3 threads G1 to G3 according to the number of slices 3, and then apply for a segment from the memory 109 according to the number of rows 9 The memory space n 0 accommodating 9 rows of data, and then call 3 threads to read the file to be read to the memory space n 0 concurrently. Among them, thread G1 reads slice 1, thread G2 reads slice 2, thread G3 reads slice 3. Specifically, thread G1 determines slice 1 according to the row number 1 of slice 1 and the row number 4 of the next slice (slice 2). The length of is 3 lines, thread G2 determines the length of slice 2 is 3 lines according to the line number 3 of slice 2 and the line number 7 of the next slice (slice 3), and thread G3 determines the length of slice 2 according to the line number 7 of slice 4 and the total number of rows 9. The length of slice 3 is 3 lines, and then thread G1 sets the read pointer to the offset value w 1 and reads 3 lines of data L 1 ~L 3 to the first three lines of the memory space n 0 , and thread G2 sets the read pointer to the offset Shift w 4 and read 3 rows of data L 4 ~L 6 to rows 3 to 6 of the memory space n 0. Thread G3 sets the read pointer to the offset value w 7 and reads 3 rows of data L 7 ~L 9 to In the last three lines of the memory space n 0 , threads G1, G2, and G3 process the above tasks concurrently, thereby completing a concurrent file reading.
综上可知,本申请提供的数据处理方法,存储节点220在计算节点210读取待读取文件之前,提前生成了待读取文件的元数据,使得计算节点210从存储节点220读取待读取文件时,可以根据待读取文件的元数据确定待读取文件的长度、切片数量以及每个切片在待读取文件中的起始位置等信息,从而达到一次性申请内存空间,多个线程并发读取文件的目的,不仅避免了由于无法确定数据类型导致内存空间数据结构初始化有误、数据处理失败的问题,还避免了由于无法确定待读取文件的行数导致多次扩充内存空间造成的资源浪费,又可以并发读取文件,使得计算节点210读取文件的速度得到极大提升,进一步提升大数据和AI任务的处理效率。In summary, in the data processing method provided by this application, the storage node 220 generates metadata of the file to be read in advance before the computing node 210 reads the file to be read, so that the computing node 210 reads the file to be read from the storage node 220 When fetching a file, you can determine the length of the file to be read, the number of slices, and the starting position of each slice in the file to be read based on the metadata of the file to be read, so as to achieve a one-time application for memory space. The purpose of threads to read files concurrently not only avoids the problem of incorrect initialization of the memory space data structure and failure of data processing due to the inability to determine the data type, but also avoids multiple expansions of the memory space due to the inability to determine the number of lines of the file to be read The resulting waste of resources, and the ability to read files concurrently, greatly improves the speed at which the computing node 210 reads files, and further improves the processing efficiency of big data and AI tasks.
上述步骤S810~步骤S830是本申请提供的通用的数据读取方法,参考前述内容可知,不同数据类型的待读取文件的元数据格式不同,因此在不同应用场景下的数据读取流程有着细微差别,为了使本申请能够被更好的理解,下面结合一具体的应用场景,以存储节点220将待读取文件和对应的元数据以相同的文件名存储在同一路径下,且待读取文件的数据类型为稠密矩阵,元数据格式如图5所示为例,对上述计算节点210根据该元数据读取待读取文件的读取过程进行详细说明。The above steps S810 to S830 are the general data reading method provided by this application. With reference to the foregoing content, it can be seen that the metadata format of the file to be read is different for different data types, so the data reading process in different application scenarios has subtleties. To make this application better understood, the following combines a specific application scenario, and the storage node 220 stores the file to be read and the corresponding metadata under the same file name under the same path, and the file to be read is stored in the same path. The data type of the file is a dense matrix, and the metadata format is shown in FIG. 5 as an example. The reading process of the aforementioned computing node 210 reading the file to be read according to the metadata is described in detail.
如图10所示,该应用场景下计算节点210从存储节点220获取待读取文件的元数据的步骤流程可以如下:As shown in FIG. 10, in this application scenario, the process for the computing node 210 to obtain the metadata of the file to be read from the storage node 220 may be as follows:
S1001:获取待读取文件的读取路径。/pathA/pathB/pathC/…/pathN/dataA.exp,其中,exp为通用数据格式,比如scv、libsvm等。S1001: Obtain the read path of the file to be read. /pathA/pathB/pathC/.../pathN/dataA.exp, where exp is a general data format, such as scv, libsvm, etc.
S1002:在同路径或者指定路径根据共同标识查找待读取文件对应的元数据是否存在,如果存在执行步骤S1003,如果不存在执行步骤S1011。假设元数据扩展名为metadata,可以在同路径查找/pathA/pathB/pathC/…/pathN/data.metadata,确定待读取文件dataA.exp的元数据dataA.metadata是否存在。S1002: Search whether the metadata corresponding to the file to be read exists in the same path or the designated path according to the common identifier, if it exists, execute step S1003, if it does not exist, execute step S1011. Assuming that the metadata extension is metadata, you can search /pathA/pathB/pathC/.../pathN/data.metadata in the same path to determine whether the metadata dataA.metadata of the file dataA.exp to be read exists.
S1003:打开并加载元数据文件。S1003: Open and load the metadata file.
S1004:获取元数据文件的(4)校验掩码,并对(4)校验掩码进行验证,该校验掩码验 证成功的情况下,表明该位置是元数据文件的头部,可以开始读取元数据文件,即执行步骤S1005;该校验掩码验证失败的情况下,表明该位置不是元数据文件的头部,计算节点210可以停止读取元数据,通过其他方法读取待读取文件,即执行步骤S1011。S1004: Obtain (4) the check mask of the metadata file, and verify (4) the check mask. If the check mask is successfully verified, it indicates that the position is the head of the metadata file. Start to read the metadata file, that is, perform step S1005; if the verification mask fails to verify, it indicates that the position is not the head of the metadata file, and the computing node 210 can stop reading the metadata, and read the pending data through other methods. To read the file, step S1011 is executed.
S1005:获取(5)元数据校验值,并对其进行验证,该元数据校验值验证成功的情况下,表明元数据在存储至存储节点220之后没有发生过改变,计算节点210可以根据该元数据中的内容读取待读取文件,继续执行步骤S1006;该元数据校验值验证失败的情况下,表明元数据可能由于数据丢失等原因发生过改变,计算节点210可以停止读取元数据,并执行步骤S1011。S1005: Obtain (5) the metadata check value and verify it. If the metadata check value is successfully verified, it indicates that the metadata has not been changed after being stored in the storage node 220, and the computing node 210 can follow The content in the metadata reads the file to be read, and continues to step S1006; if the metadata check value verification fails, it indicates that the metadata may have been changed due to data loss or other reasons, and the computing node 210 can stop reading Metadata, and step S1011 is executed.
具体实现中,可以根据元数据存储时的数据长度等信息按照一定规则生成(5)元数据校验值,这样,当计算节点210在读取元数据时,可以根据当前元数据的数据长度等信息按照同样的规则生成一个用于验证的校验值,如果该校验值与(5)元数据校验值相等,则证明元数据信息没有发生过改变,可以继续执行步骤S1006;如果不等则证明元数据信息可能因为数据丢失等原因发生了改变。应理解,上述(5)元数据校验值的实现方式仅用于举例说明,本申请不对元数据校验值的校验方法具体限定。In specific implementation, the metadata check value can be generated according to certain rules according to the data length and other information when the metadata is stored. In this way, when the computing node 210 is reading the metadata, it can be based on the data length of the current metadata, etc. The information generates a check value for verification according to the same rules. If the check value is equal to (5) the metadata check value, it proves that the metadata information has not changed, and step S1006 can be continued; if not, It proves that the metadata information may have changed due to data loss and other reasons. It should be understood that the above (5) implementation of the metadata check value is only used for illustration, and this application does not specifically limit the check method of the metadata.
S1006:获取(6)文件校验值,并对其进行验证,文件校验值验证成功的情况下,表示待读取文件在存储后没有发生过改变,继续执行步骤S1007;文件校验值验证失败的情况下,表示待读取文件在存储后可能由于数据丢失等原因发生过改变,计算节点210可以停止读取待读取文件,返回读取失败的信息,即执行步骤S1012。S1006: Obtain (6) the file check value and verify it. If the file check value is successfully verified, it means that the file to be read has not been changed after storage. Continue to step S1007; file check value verification In the case of failure, it means that the file to be read may have been changed due to data loss or other reasons after being stored. The computing node 210 may stop reading the file to be read and return the message that the reading has failed, that is, step S1012 is executed.
具体实现中,计算节点210可以先确定该文件校验值是否有效,避免某些存储节点220没有生成文件校验值,导致(6)文件校验值的部分是无意义的字符串,因此,如果文件校验值是无效的,可以直接执行步骤S1007,如果校验值有效,可以对该校验值进行验证,文件校验值验证成功的情况下,继续执行步骤S1007;文件校验值验证失败的情况下,计算节点210可以停止读取待读取文件,返回读取失败的信息,即执行步骤S1012。In specific implementation, the computing node 210 may first determine whether the file check value is valid, so as to avoid that some storage nodes 220 do not generate the file check value, resulting in (6) the file check value part is a meaningless character string. Therefore, If the file check value is invalid, you can directly execute step S1007. If the check value is valid, the check value can be verified. If the file check value is successfully verified, continue to step S1007; file check value verification In the case of failure, the computing node 210 may stop reading the file to be read, and return information that the reading has failed, that is, step S1012 is executed.
S1007:获取(7)元数据格式版本、(8)文件格式版本以及(9)数据类型,比如,格式版本为V1,文件格式为CSV,数据类型为稠密矩阵,确定当前计算节点210是否支持处理元数据格式版本为V1,文件格式为CSV,数据类型为稠密矩阵的待读取文件,在支持的情况下,计算节点210可以执行步骤S1008,在不支持的情况下执行步骤S1011。S1007: Obtain (7) metadata format version, (8) file format version, and (9) data type, for example, the format version is V1, the file format is CSV, and the data type is dense matrix, to determine whether the current computing node 210 supports processing The metadata format version is V1, the file format is CSV, and the data type is a file to be read with a dense matrix. If it is supported, the computing node 210 may execute step S1008, and if it is not supported, step S1011 may be executed.
S1008:根据行数(1)申请用于加载待读取文件的内存空间,并根据(10)特征值类型初始化内存空间的数据结构。S1008: Apply for memory space for loading the file to be read according to the number of rows (1), and initialize the data structure of the memory space according to (10) the characteristic value type.
S1009:计算节点210获取切片数量(2)为x,根据处理器当前所拥有的内核数量和处理器处理能力,创建y个线程,其中,y小于或等于x。当然,也可以提前设置好每次读取文件使用的线程数量为y’,如果y’不大于x,即可申请y’个线程进行数据处理,如果y’大于x,可以申请x个线程进行数据处理。S1009: The computing node 210 obtains the number of slices (2) as x, and creates y threads according to the number of cores currently owned by the processor and the processing capability of the processor, where y is less than or equal to x. Of course, you can also set in advance the number of threads used to read files as y', if y'is not greater than x, you can apply for y'threads for data processing, if y'is greater than x, you can apply for x threads for processing data processing.
S1010:每个线程按照队列的顺序并发读取所有切片至上述内存空间。S1010: Each thread concurrently reads all slices to the aforementioned memory space in the order of the queue.
如果线程数量等于切片数量,此时可以线程1读取切片1,线程2读取切片2,并以此类推,使得多个线程可以并行读取多个切片,极大提高待读取文件的读取效率,进而提高整个大数据或AI任务的处理效率。If the number of threads is equal to the number of slices, then thread 1 can read slice 1, thread 2 can read slice 2, and so on, so that multiple threads can read multiple slices in parallel, which greatly improves the reading of the file to be read To improve the efficiency, the processing efficiency of the entire big data or AI task is improved.
如果线程数量少于切片数量,比如线程数量为8,切片数量为16,那么先一个线程处理一个切片,每个线程处理完当前切片后,继续从剩余切片中取一个切片继续处理,比如线程1处理完切片1后,切片9待处理,线程1可以继续处理切片9,其他线程也按照同样策略去执行,直到处理完所有切片,具体可以通过轮询调度算法(round-robin scheduling)实现上述 过程,这里不展开赘述。If the number of threads is less than the number of slices, for example, the number of threads is 8 and the number of slices is 16, then one thread processes a slice first, and after each thread processes the current slice, it continues to take a slice from the remaining slices to continue processing, such as thread 1. After processing slice 1, slice 9 is to be processed, thread 1 can continue to process slice 9, and other threads can also execute according to the same strategy until all slices are processed. Specifically, the above process can be achieved through round-robin scheduling. , I won’t go into details here.
当然,也可以是根据每个切片的起始位置确定每个切片的长度后,直接将全部切片分配各全部线程,仍以上述例子为例,线程数量为8但切片数量为16,先确定每个切片的长度l 1~l 16,然后分配线程1读取切片1~2,线程1直从切片1的起始位置读取l 1+l 2长度的数据,将切片1和切片2读取至内存空间,线程2直从切片3的起始位置读取l 3+l 4长度的数据,将切片3和切片4读取至内存空间等等,本申请不作具体限定。 Of course, after determining the length of each slice based on the starting position of each slice, all the slices can be directly allocated to all threads. Still taking the above example as an example, the number of threads is 8 but the number of slices is 16. The length of each slice is l 1 ~l 16 , and then thread 1 is allocated to read slice 1 to 2. Thread 1 reads data of length l 1 +l 2 from the starting position of slice 1, and reads slice 1 and slice 2. To the memory space, thread 2 directly reads data of length l 3 +l 4 from the starting position of slice 3, reads slice 3 and slice 4 to the memory space, etc., which are not specifically limited in this application.
S1011:计算节点210通过其他方法读取待读取文件,比如业内通用的其他数据处理方法,这里不作具体限定。S1011: The computing node 210 uses other methods to read the file to be read, such as other data processing methods commonly used in the industry, which are not specifically limited here.
S1012:计算节点210停止读取待读取文件,返回待读取文件数据出错,读取失败的信息。S1012: The computing node 210 stops reading the file to be read, and returns information that there is an error in the data of the file to be read and the reading has failed.
可以理解的,上述数据处理方法,通过提前在存储节点220中存储待读取文件的元数据,使得计算节点210从存储节点220读取待读取文件时,可以根据元数据有效初始化内存空间,避免数据结构错误导致读取失败,还可以根据元数据一次性申请能够容纳待读取文件的内存空间,避免多次扩增内存空间带来的资源占用浪费,还可以根据元数据并发读取待读取文件,提高数据读取的效率,进而提高整个AI任务以及大数据任务的处理效率。并且,该元数据可以追加更多的信息,以增加数据安全性可靠性等功能要求,可扩展性很强。It is understandable that the foregoing data processing method stores the metadata of the file to be read in the storage node 220 in advance, so that when the computing node 210 reads the file to be read from the storage node 220, the memory space can be effectively initialized according to the metadata. To avoid reading failure due to data structure errors, you can also apply for a memory space that can accommodate the file to be read at one time based on the metadata, avoiding the waste of resource occupation caused by multiple expansions of the memory space, and you can also read the waiting data concurrently according to the metadata. Read files, improve the efficiency of data reading, and then improve the processing efficiency of the entire AI task and big data tasks. In addition, more information can be added to the metadata to increase functional requirements such as data security and reliability, and it is highly scalable.
下面结合另一具体的应用场景,对上述步骤S810~步骤S830进行举例说明,其中,该应用场景中存储节点220将元数据按照图7所示的方式存储在待读取文件尾部,且待读取文件的数据类型为稀疏矩阵,元数据格式如图6所示为例,对上述计算节点210根据该元数据读取待读取文件的读取过程进行详细说明。The following describes the above steps S810 to S830 with an example in conjunction with another specific application scenario. In this application scenario, the storage node 220 stores the metadata at the end of the file to be read in the manner shown in FIG. The data type of the file is taken as a sparse matrix, and the metadata format is shown in FIG. 6 as an example. The reading process of the aforementioned computing node 210 reading the file to be read according to the metadata is described in detail.
如图11所示,该应用场景下计算节点210从存储节点220获取待读取文件的元数据的步骤流程可以如下:As shown in FIG. 11, in this application scenario, the process for the computing node 210 to obtain the metadata of the file to be read from the storage node 220 may be as follows:
S1101:打开待读取文件。S1101: Open the file to be read.
S1102:确定文件大小(File Size)后,设置当前读指针到文件尾部;S1102: After determining the file size (File Size), set the current read pointer to the end of the file;
S1103:逆向的方式读取尾部文件一定范围内的内容,并判断该范围内的内容中是否存在匹配格式(即(13)校验掩码的格式),存在的情况下,证明该位置是元数据的(13)校验掩码,即可执行步骤S1104。如果不存在,则表示该文件没有追加元数据,计算节点210可以使用通用的数据处理方法进行数据处理,即执行步骤S1112。S1103: Reversely read the content in a certain range of the tail file, and determine whether there is a matching format (that is, the format of (13) check mask) in the content within the range. If it exists, prove that the position is a meta The (13) check mask of the data can execute step S1104. If it does not exist, it means that no metadata is added to the file, and the computing node 210 can use a general data processing method for data processing, that is, step S1112 is executed.
S1104:获取校验掩码(13)之后的元数据头部偏移位置(14),将读指针偏移至该元数据头部偏移值,开始读取元数据文件;S1104: Obtain the metadata header offset position (14) after the check mask (13), shift the read pointer to the metadata header offset value, and start reading the metadata file;
S1105:获取元数据中的(4)校验掩码,并对(4)校验掩码进行二次验证,进一步确认该位置是否是元数据的头部位置,该校验掩码验证成功的情况下,即执行步骤S1106;该校验掩码验证失败的情况下,即执行步骤S1112。具体可以参考前述步骤S1004,这里不重复赘述。S1105: Obtain (4) the check mask in the metadata, and perform a second verification of (4) the check mask to further confirm whether the position is the head position of the metadata, and the check mask is successfully verified In this case, step S1106 is executed; when the verification of the check mask fails, step S1112 is executed. For details, please refer to the aforementioned step S1004, which will not be repeated here.
S1106:获取(5)元数据校验值,并对其进行验证,该元数据校验值验证成功的情况下,继续执行步骤S1107;该元数据校验值验证失败的情况下,执行步骤S1112。具体可以参考前述步骤S1005,这里不重复赘述。S1106: Obtain (5) the metadata check value and verify it. If the metadata check value is successfully verified, proceed to step S1107; if the metadata check value fails to verify, perform step S1112 . For details, please refer to the aforementioned step S1005, which will not be repeated here.
S1107:获取文件校验值(6),并对其进行验证,文件校验值验证成功的情况下,表示待读取文件在存储后没有发生过改变,继续执行步骤S1108;文件校验值验证失败的情况下,表示待读取文件在存储后可能由于数据丢失等原因发生过改变,计算节点210可以停止读取待读取文件,执行步骤S1113。具体可以参考前述步骤1012,这里不重复赘述。S1107: Obtain the file check value (6) and verify it. If the file check value is successfully verified, it means that the file to be read has not been changed after being stored. Continue to step S1108; file check value verification In the case of failure, it means that the file to be read may have been changed due to data loss or other reasons after storage, and the computing node 210 may stop reading the file to be read, and step S1113 is executed. For details, please refer to the aforementioned step 1012, which will not be repeated here.
S1108:获取(7)元数据格式版本、(8)文件格式版本以及(9)数据类型,比如,格式 版本为V2,文件格式为CSV,数据类型为稀疏矩阵,确定当前计算节点210是否支持处理元数据格式版本为V2,文件格式为CSV,数据类型为稀疏矩阵的待读取文件,在支持的情况下,计算节点210可以执行步骤S1109,在不支持的情况下执行步骤S1112。S1108: Obtain (7) metadata format version, (8) file format version, and (9) data type, for example, the format version is V2, the file format is CSV, and the data type is sparse matrix, to determine whether the current computing node 210 supports processing The metadata format version is V2, the file format is CSV, and the data type is a file to be read with a sparse matrix. If it is supported, the computing node 210 can perform step S1109, and if it is not supported, step S1112 is performed.
S1109:根据行数(10)值数量申请用于存放数据值以及数据列索引的内存空间,根据(1)行数申请用于存放行数据量的内存空间。S1109: Apply for memory space for storing data values and data column indexes according to the number of rows (10), and apply for memory space for storing row data according to (1) the number of rows.
S1110:计算节点210获取切片数量(2)为x,再根据处理器当前所拥有的内核数量和处理器处理能力,创建y个线程,其中,y不小于或等于x。具体可以参考前述步骤S1009,这里不重复赘述。S1110: The computing node 210 obtains the number of slices (2) as x, and then creates y threads according to the number of cores currently owned by the processor and the processing capability of the processor, where y is not less than or equal to x. For details, please refer to the aforementioned step S1009, which will not be repeated here.
S1111:每个线程并发读取待读取文件的多个切片至内存空间,具体可参考前述内容的步骤S1010,这里不重复赘述。S1111: Each thread concurrently reads multiple slices of the file to be read into the memory space. For details, please refer to step S1010 of the foregoing content, and details are not repeated here.
需要说明的,对于数据类型为稀疏矩阵的待读取文件,计算节点210在调用多个线程并发读取待读取文件时,可以根据每个切片的数据列索引起始位置、每个切片的数据值起始位置以及每个切片的行数据量起始位置,调用多个线程并发读取每个切片的数据值以及每个切片的数据列索引至第一内存空间,调用多个线程并发读取每个切片的行数据量至第二内存空间,获得待读取文件。It should be noted that, for a file to be read whose data type is a sparse matrix, when the computing node 210 calls multiple threads to read the file to be read concurrently, it can be based on the starting position of the data column index of each slice and the value of each slice. The starting position of the data value and the starting position of the row data volume of each slice, call multiple threads to read the data value of each slice and the index of the data column of each slice to the first memory space, call multiple threads to read concurrently Take the row data amount of each slice to the second memory space to obtain the file to be read.
并且,出于对处理器处理性能的考虑,在一些应用场景中,计算节点210需要将稀疏矩阵转化为稠密矩阵后,再将其加载入内存空间,因此,每个线程可以根据元数据中的(1)行数、(12)列数、(10)值数量等信息,将稀疏矩阵转化为稠密矩阵后再写入内存空间,具体可以参考图6实施例,这里不重复赘述。In addition, in consideration of processor processing performance, in some application scenarios, the computing node 210 needs to convert the sparse matrix into a dense matrix, and then load it into the memory space. Therefore, each thread can be based on the metadata in the (1) The number of rows, (12) the number of columns, (10) the number of values and other information are converted into a dense matrix and then written into the memory space. For details, please refer to the embodiment in FIG. 6, and details are not repeated here.
可以理解的,如果不使用本申请提供的方法读取稀疏矩阵,计算节点210需要读取完整个待读取文件后,首先解析出待读取文件的行数、列数以及值数量等信息之后,再将稀疏矩阵转换为稠密矩阵,而使用本申请提供的方法,多个线程可以根据元数据中的行数、列数以及值数量等信息,在并发读取切片时直接将切片以稠密矩阵的形式写入内存空间,从而避免了将稀疏矩阵全部读取完再转化为稠密矩阵的过程,提高稀疏矩阵这一数据类型的待读取文件的读取效率。It is understandable that if the method provided in this application is not used to read the sparse matrix, the computing node 210 needs to read the entire file to be read, and then first parse out the number of rows, columns, and values of the file to be read. , And then convert the sparse matrix into a dense matrix. Using the method provided by this application, multiple threads can directly convert the slices into a dense matrix when reading the slices concurrently according to the number of rows, columns, and values in the metadata. The format is written into the memory space, thereby avoiding the process of converting all the sparse matrices into a dense matrix after reading all the sparse matrices, and improving the reading efficiency of the file to be read of the sparse matrix data type.
S1112:计算节点210通过其他方法读取待读取文件,比如业内通用的其他数据处理方法,这里不作具体限定。S1112: The computing node 210 uses other methods to read the file to be read, such as other data processing methods commonly used in the industry, which are not specifically limited here.
S1113:计算节点210停止读取待读取文件,返回待读取文件数据出错,读取失败的信息。S1113: The computing node 210 stops reading the file to be read, and returns information that the data of the file to be read has an error and the reading has failed.
可以理解的,上述数据处理方法,通过提前在存储节点220中存储待读取文件的元数据,使得计算节点210从存储节点220读取待读取文件时,可以先根据元数据有效初始化内存空间,避免数据结构错误导致读取失败,还可以根据元数据一次性申请用于存放待读取文件的内存空间,避免多次扩增内存空间带来的资源占用浪费,还可以根据元数据并发读取待读取文件,提高数据读取的效率,进而提高整个AI任务以及大数据任务的处理效率,还可以根据元数据,将数据类型为稀疏矩阵的待读取文件直接转换为稠密矩阵加载入内存,提高稀疏矩阵的读取效率,并且,该元数据可以追加更多的信息,以适应更多类型的数据文件的读取,使得该数据处理方法的适用性非常广泛。It is understandable that the foregoing data processing method stores metadata of the file to be read in the storage node 220 in advance, so that when the computing node 210 reads the file to be read from the storage node 220, the memory space can be effectively initialized according to the metadata. , To avoid data structure errors leading to read failures, you can also apply for the memory space used to store the files to be read at one time based on the metadata, avoid the waste of resource occupation caused by multiple expansions of the memory space, and you can also read concurrently based on the metadata Take the file to be read to improve the efficiency of data reading, thereby improving the processing efficiency of the entire AI task and big data task. According to the metadata, the file to be read with the data type of sparse matrix can be directly converted into a dense matrix and loaded into The memory improves the efficiency of reading the sparse matrix, and the metadata can be appended with more information to adapt to the reading of more types of data files, which makes the data processing method applicable to a very wide range.
上面详细阐述了本申请实施例的方法,为了便于更好的实时本申请实施例上述方案,相应地,下面还提供用于配合实施上述方案的相关设备。The methods of the embodiments of the present application are described in detail above. In order to facilitate better real-time implementation of the above-mentioned solutions in the embodiments of the present application, correspondingly, related equipment for cooperating with the implementation of the above-mentioned solutions is also provided below.
图12是本申请提供的一种计算节点210的结构示意图,该计算节点210应用于图3所示的数据处理系统400,计算节点210包括:FIG. 12 is a schematic structural diagram of a computing node 210 provided by the present application. The computing node 210 is applied to the data processing system 400 shown in FIG. 3, and the computing node 210 includes:
元数据读取单元211,用于获取待读取文件的元数据,其中,待读取文件的元数据包括 待读取文件的切片数量、行数、以及每个切片在待读取文件中的起始位置;The metadata reading unit 211 is configured to obtain metadata of the file to be read, where the metadata of the file to be read includes the number of slices, the number of rows, and the number of slices in the file to be read. starting point;
切片读取单元212,用于根据每个切片在待读取文件中的起始位置,调用多个线程,并发读取每个切片的数据,其中,多个线程是计算节点根据切片数量创建的;The slice reading unit 212 is configured to call multiple threads according to the starting position of each slice in the file to be read, and concurrently read the data of each slice, wherein the multiple threads are created by the computing node according to the number of slices ;
切片读取单元212还用于按照每个切片在待读取文件中的起始位置的顺序,将每个切片的数据存储至内存空间,其中,内存空间是计算节点根据行数申请得到的。The slice reading unit 212 is further configured to store the data of each slice in the memory space according to the order of the starting position of each slice in the file to be read, where the memory space is obtained by the computing node according to the number of rows.
可选地,待读取文件的元数据是存储节点根据待读取文件的数据类型确定待读取文件的元数据格式后,根据元数据格式和待读取文件生成的,其中,不同的数据类型的待读取文件的元数据格式不同。Optionally, the metadata of the file to be read is generated according to the metadata format and the file to be read after the storage node determines the metadata format of the file to be read according to the data type of the file to be read, where different data The metadata format of the file to be read is different.
可选地,待读取文件的元数据存储于待读取文件中,待读取文件的末尾包括元数据在待读取文件中的起始位置,元数据读取单元211用于从待读取文件的末尾获得元数据在待读取文件中的起始位置;元数据读取单元211用于根据元数据在待读取文件中的起始位置,读取待读取文件的元数据。Optionally, the metadata of the file to be read is stored in the file to be read, the end of the file to be read includes the starting position of the metadata in the file to be read, and the metadata reading unit 211 is used to read from the file to be read. The end of the file is taken to obtain the starting position of the metadata in the file to be read; the metadata reading unit 211 is configured to read the metadata of the file to be read according to the starting position of the metadata in the file to be read.
可选地,待读取文件的元数据存储于存储节点的指定路径。Optionally, the metadata of the file to be read is stored in a designated path of the storage node.
可选地,待读取文件的元数据存储位置与待读取文件的存储位置相同。Optionally, the metadata storage location of the file to be read is the same as the storage location of the file to be read.
可选地,待读取文件和待读取文件的元数据包括共同标识,元数据读取单元211用于从存储节点获取待读取文件的共同标识;元数据读取单元211用于根据待读取文件的共同标识,从指定路径或者待读取文件的存储位置获取待读取文件的元数据。Optionally, the file to be read and the metadata of the file to be read include a common identification, and the metadata reading unit 211 is used to obtain the common identification of the file to be read from the storage node; the metadata reading unit 211 is used to obtain the common identification of the file to be read according to the Read the common identification of the file, and obtain the metadata of the file to be read from the specified path or the storage location of the file to be read.
可选地,待读取文件的元数据包括校验信息,校验信息用于校验待读取文件的元数据在存储至存储节点后是否发生过变化,切片读取单元212用于在根据每个切片在待读取文件中的起始位置,调用多个线程,并发读取每个切片的数据之前,根据校验信息校验待读取文件的元数据在存储至存储节点后是否发生过变化;切片读取单元212用于在待读取文件的元数据在存储至存储节点后未发生过变化的情况下,根据每个切片在待读取文件中的起始位置,调用多个线程,并发读取每个切片的数据。Optionally, the metadata of the file to be read includes verification information. The verification information is used to verify whether the metadata of the file to be read has changed after being stored in the storage node. The slice reading unit 212 is used to Each slice is at the starting position in the file to be read, multiple threads are called, and before the data of each slice is read concurrently, the metadata of the file to be read is verified according to the verification information whether it occurs after it is stored in the storage node The slice reading unit 212 is used to call multiple slices according to the starting position of each slice in the file to be read when the metadata of the file to be read has not changed after being stored in the storage node Thread, read the data of each slice concurrently.
可选地,待读取文件的元数据还包括数据类型,数据类型是稠密矩阵的情况下,元数据还包括特征值类型,特征值类型用于供计算节点根据特征值类型初始化内存空间的数据结构,切片读取单元212用于在根据每个切片在待读取文件中的起始位置,调用多个线程,并发读取每个切片的数据之前,根据数据类型初始化内存空间的数据结构。Optionally, the metadata of the file to be read also includes a data type. When the data type is a dense matrix, the metadata also includes a feature value type. The feature value type is used for the computing node to initialize the data of the memory space according to the feature value type. Structure, the slice reading unit 212 is used to initialize the data structure of the memory space according to the data type before calling multiple threads according to the starting position of each slice in the file to be read, and reading the data of each slice concurrently.
可选地,数据类型是稀疏矩阵的情况下,待读取文件包括数据值、数据列索引以及行数据量,元数据还包括值数量,值数量用于申请用于存放数据值以及数据列索引的第一内存空间,切片读取单元212用于根据每个切片在待读取文件中的起始位置,调用多个线程,并发读取每个切片之前,根据值数量申请用于存放数据值以及数据列索引的第一内存空间;切片读取单元用于根据行数申请用于存放行数据量的第二内存空间,根据第一内存空间和第二内存空间获得用于存放待读取文件的内存空间。Optionally, when the data type is a sparse matrix, the file to be read includes the data value, data column index, and row data amount. The metadata also includes the number of values, and the number of values is used to apply for storing data values and data column indexes. The slice reading unit 212 is used to call multiple threads according to the starting position of each slice in the file to be read, and before concurrently reading each slice, apply for storing data values according to the number of values And the first memory space of the data column index; the slice reading unit is used to apply for the second memory space for storing the amount of row data according to the number of rows, and obtain the second memory space for storing the file to be read according to the first memory space and the second memory space Memory space.
可选地,数据类型是稀疏矩阵的情况下,每个切片在待读取文件中的起始位置包括每个切片的数据列索引起始位置、每个切片的数据值起始位置以及每个切片的行数据量起始位置;切片读取单元211用于在按照每个切片在待读取文件中的起始位置的顺序,将每个切片的数据存储至内存空间之前,根据每个切片的数据列索引起始位置的顺序以及每个切片的数据值的起始位置的顺序,将每个切片的数据列索引以及数据值存储至第一内存空间,根据每个切片的行数据量的起始位置的顺序,将每个切片的行数据量存储至第二内存空间。Optionally, when the data type is a sparse matrix, the starting position of each slice in the file to be read includes the starting position of the data column index of each slice, the starting position of the data value of each slice, and each The starting position of the row data amount of the slice; the slice reading unit 211 is used to store the data of each slice in the memory space according to the order of the starting position of each slice in the file to be read, according to each slice The order of the starting position of the data column index and the starting position of the data value of each slice, the data column index and data value of each slice are stored in the first memory space, according to the amount of row data of each slice In the order of the starting position, the amount of row data of each slice is stored in the second memory space.
应理解的是,本申请实施例的计算节点210可以通过专用集成电路(application-specific integrated circuit,ASIC)实现,或可编程逻辑器件(programmable logic device,PLD)实现, 上述PLD可以是复杂程序逻辑器件(complex programmable logical device,CPLD),现场可编程门阵列(field-programmable gate array,FPGA),通用阵列逻辑(generic array logic,GAL)或其任意组合。也可以通过软件实现图1至图11中所示的数据处理方法时,计算节点210及其各个模块也可以为软件模块。It should be understood that the computing node 210 in the embodiment of the present application may be implemented by an application-specific integrated circuit (ASIC) or a programmable logic device (PLD), and the above PLD may be complex program logic. Device (complex programmable logical device, CPLD), field-programmable gate array (field-programmable gate array, FPGA), generic array logic (generic array logic, GAL) or any combination thereof. When the data processing method shown in FIG. 1 to FIG. 11 can also be implemented by software, the computing node 210 and its various modules can also be software modules.
根据本申请实施例的计算节点210可对应于执行本申请实施例中描述的方法,并且计算节点210中的各个单元的上述和其它操作和/或功能分别为了实现图1至图11中的各个方法的相应流程,为了简洁,在此不再赘述。The computing node 210 according to the embodiment of the present application may correspond to executing the method described in the embodiment of the present application, and the foregoing and other operations and/or functions of each unit in the computing node 210 are to implement each of FIGS. 1 to 11. For the sake of brevity, the corresponding process of the method will not be repeated here.
综上可知,本申请提供计算节点在进行数据读取时,由存储节点220在计算节点210读取待读取文件之前,提前生成了待读取文件的元数据,使得计算节点210从存储节点220读取待读取文件时,可以根据待读取文件的元数据确定待读取文件的长度、切片数量以及每个切片在待读取文件中的起始位置等信息,从而达到一次性申请内存空间,多个线程并发读取文件的目的,不仅避免了由于无法确定数据类型导致内存空间数据结构初始化有误、数据处理失败的问题,还避免了由于无法确定待读取文件的行数导致多次扩充内存空间造成的资源浪费,又可以并发读取文件,使得计算节点210读取文件的速度得到极大提升,进一步提升大数据和AI任务的处理效率。In summary, this application provides that when a computing node performs data reading, the storage node 220 generates the metadata of the file to be read in advance before the computing node 210 reads the file to be read, so that the computing node 210 obtains data from the storage node. 220 When reading the file to be read, the length of the file to be read, the number of slices, and the starting position of each slice in the file to be read can be determined according to the metadata of the file to be read, so as to achieve a one-time application Memory space, the purpose of multiple threads concurrently reading files, not only avoids the problem of incorrect initialization of the memory space data structure and failure of data processing due to the inability to determine the data type, but also avoids the inability to determine the number of lines in the file to be read The waste of resources caused by multiple expansions of the memory space, and the ability to read files concurrently, greatly improves the speed at which the computing node 210 reads files, and further improves the processing efficiency of big data and AI tasks.
图13为本申请实施例提供的一种服务器1300的结构示意图。其中,服务器1300可以是图1-图11实施例中的计算节点210以及存储节点220。如图13所示,服务器1300包括:处理器1310、通信接口1320以及存储器1330。其中,处理器1310、通信接口1320以及存储器1330可以通过内部总线1340相互连接,也可通过无线传输等其他手段实现通信。本申请实施例以通过总线1340连接为例,总线1340可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等。总线1340可以分为地址总线、数据总线、控制总线等。为便于表示,图13中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。FIG. 13 is a schematic structural diagram of a server 1300 provided by an embodiment of this application. The server 1300 may be the computing node 210 and the storage node 220 in the embodiment of FIG. 1 to FIG. 11. As shown in FIG. 13, the server 1300 includes a processor 1310, a communication interface 1320, and a memory 1330. Among them, the processor 1310, the communication interface 1320, and the memory 1330 may be connected to each other through an internal bus 1340, and may also communicate through other means such as wireless transmission. The embodiment of the present application takes the connection via the bus 1340 as an example. The bus 1340 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus. The bus 1340 can be divided into an address bus, a data bus, a control bus, and so on. For ease of representation, only one thick line is used in FIG. 13, but it does not mean that there is only one bus or one type of bus.
处理器1310可以由至少一个通用处理器构成,例如CPU,或者CPU和硬件芯片的组合。上述硬件芯片可以是ASIC、PLD或其组合。上述PLD可以是CPLD、FPGA、GAL或其任意组合。处理器1310执行各种类型的数字存储指令,例如存储在存储器1330中的软件或者固件程序,它能使计算节点210提供多种服务。处理器1310可以是图1所示的多核处理器,也可以是多CPU多核处理器,本申请不作具体限定。The processor 1310 may be constituted by at least one general-purpose processor, such as a CPU, or a combination of a CPU and a hardware chip. The above-mentioned hardware chip may be ASIC, PLD or a combination thereof. The above-mentioned PLD can be CPLD, FPGA, GAL or any combination thereof. The processor 1310 executes various types of digital storage instructions, such as software or firmware programs stored in the memory 1330, which enables the computing node 210 to provide various services. The processor 1310 may be a multi-core processor shown in FIG. 1 or a multi-CPU multi-core processor, which is not specifically limited in this application.
在服务器1300是计算节点210的情况下,存储器1330用于存储程序代码,并由处理器1310来控制执行,以执行上述图1-图11中任一实施例中计算节点210的处理步骤。程序代码中可以包括一个或多个软件模块,这一个或多个软件模块可以为图1实施例中提供的计算节点210的软件单元,如元数据读取单元、切片读取单元等等,其中,元数据读取单元用于从存储节点获取待读取文件的元数据;切片读取单元用于根据切片数量和计算节点的处理器的处理能力创建多个线程,并根据行数申请用于存放待读取文件的内存空间;切片读取单元还用于根据每个切片在待读取文件中的起始位置,调用多个线程,并发读取每个切片至内存空间,获得待读取文件。具体可用于执行图8和图9实施例中的S810-步骤S830及其可选步骤、图10实施例中的步骤S1001~步骤S1012及其可选步骤、图11实施例中的步骤S1101~步骤S1113及其可选步骤,还可以用于执行图1-图11实施例描述的其他由计算节点210执行的步骤,这里不再进行赘述。In the case where the server 1300 is the computing node 210, the memory 1330 is used to store program codes, which are controlled to execute by the processor 1310, so as to execute the processing steps of the computing node 210 in any of the embodiments in FIG. 1 to FIG. 11 described above. The program code may include one or more software modules, and the one or more software modules may be software units of the computing node 210 provided in the embodiment of FIG. 1, such as a metadata reading unit, a slice reading unit, etc. , The metadata reading unit is used to obtain the metadata of the file to be read from the storage node; the slice reading unit is used to create multiple threads according to the number of slices and the processing capacity of the processor of the computing node, and apply for Store the memory space of the file to be read; the slice reading unit is also used to call multiple threads according to the starting position of each slice in the file to be read, and concurrently read each slice to the memory space to obtain the file to be read document. Specifically, it can be used to execute S810-step S830 and its optional steps in the embodiment of FIG. 8 and FIG. 9, step S1001 to step S1012 and its optional steps in the embodiment of FIG. 10, and step S1101 to step in the embodiment of FIG. 11 S1113 and its optional steps can also be used to perform other steps performed by the computing node 210 described in the embodiments in FIG. 1 to FIG. 11, and details are not described herein again.
在服务器1300是存储节点220的情况下,存储器1330用于存储程序代码,并由处理器1310来控制执行,以执行上述图1-图11中任一实施例中存储节点210的处理步骤。程序代 码可以包括一个或多个软件模块,这一个或多个软件模块可以为图1实施例中提供的存储节点220的软件单元,如元数据生成单元,其中,元数据生成单元用于从存储节点220根据待读取文件,获得待读取文件的元数据,待读取文件的元数据包括待读取文件的切片数量、行数、以及每个切片在待读取文件中的起始位置。具体可用于执行图5实施例中的S510-步骤S520及其可选步骤,还可以用于执行图1-图11实施例描述的其他由存储节点220执行的步骤,这里不再进行赘述。In the case where the server 1300 is the storage node 220, the memory 1330 is used to store program codes, which are controlled by the processor 1310 to execute, so as to execute the processing steps of the storage node 210 in any of the embodiments in FIG. 1 to FIG. 11 described above. The program code may include one or more software modules. The one or more software modules may be a software unit of the storage node 220 provided in the embodiment of FIG. The node 220 obtains the metadata of the file to be read according to the file to be read. The metadata of the file to be read includes the number of slices, the number of rows, and the starting position of each slice in the file to be read. . Specifically, it can be used to perform S510-step S520 and optional steps in the embodiment of FIG. 5, and can also be used to perform other steps performed by the storage node 220 described in the embodiments of FIG.
存储器1330可以包括易失性存储器(volatile memory),例如随机存取存储器(random access memory,RAM);存储器1030也可以包括非易失性存储器(non-volatile memory),例如只读存储器(read-only memory,ROM)、快闪存储器(flash memory)、硬盘(hard disk drive,HDD)或固态硬盘(solid-state drive,SSD);存储器1330还可以包括上述种类的组合。存储器还存储有程序代码,在服务器1300是计算节点210的情况下,具体可以包括用于执行图1-图11实施例描述由计算节点执行的步骤的程序代码,在服务器1300是存储节点220的情况下,具体可以包括用于执行图1-图11实施例描述的有存储节点执行的步骤的程序代码,并且,存储有待读取文件以及待读取文件的元数据。The memory 1330 may include a volatile memory (volatile memory), such as a random access memory (random access memory, RAM); the memory 1030 may also include a non-volatile memory (non-volatile memory), such as a read-only memory (read-only memory). Only memory (ROM), flash memory (flash memory), hard disk drive (HDD) or solid-state drive (SSD); memory 1330 may also include a combination of the above types. The memory also stores program code. In the case where the server 1300 is the computing node 210, it may specifically include program code for executing the steps performed by the computing node described in the embodiments of FIG. 1 to FIG. 11. The server 1300 is the storage node 220 In this case, it may specifically include program code for executing the steps performed by the storage node described in the embodiments of FIG. 1 to FIG. 11, and store the file to be read and the metadata of the file to be read.
通信接口1320可以为有线接口(例如以太网接口),可以为内部接口(例如高速串行计算机扩展总线(peripheral component interconnect express,PCIe)总线接口)、有线接口(例如以太网接口)或无线接口(例如蜂窝网络接口或使用无线局域网接口),用于与与其他设备或模块进行通信。The communication interface 1320 may be a wired interface (such as an Ethernet interface), an internal interface (such as a high-speed serial computer expansion bus (peripheral component interconnect express, PCIe) bus interface), a wired interface (such as an Ethernet interface), or a wireless interface ( For example, a cellular network interface or the use of a wireless local area network interface) to communicate with other devices or modules.
需要说明的是,本实施例可以是通用的物理服务器实现的,例如,ARM服务器或者X86服务器,也可以是基于通用的物理服务器结合NFV技术实现的虚拟机实现的,虚拟机指通过 软件模拟的具有完整 硬件系统功能的、运行在一个完全 隔离环境中的完整 计算机系统,比如在本实施例可以在云计算基础设施上实现。 Incidentally, the present embodiment may be a common physical server implementations, e.g., the ARM server or X86 server may be a common physical servers based on a combination NFV technology virtual machine implementation, the virtual machine refers to a software simulation a complete hardware system functions, in a computer system running a full completely isolated environment, such as in the present embodiment may be implemented on a cloud computing infrastructure.
需要说明的,图13仅仅是本申请实施例的一种可能的实现方式,实际应用中,服务器1300还可以包括更多或更少的部件,这里不作限制。关于本申请实施例中未示出或未描述的内容,可参见前述图1-图11实施例中的相关阐述,这里不再赘述。It should be noted that FIG. 13 is only a possible implementation of the embodiment of the present application. In actual applications, the server 1300 may also include more or fewer components, which is not limited here. Regarding the content that is not shown or described in the embodiments of the present application, please refer to the relevant descriptions in the foregoing embodiments of FIG. 1 to FIG. 11, which will not be repeated here.
应理解,图13所示的服务器还可以是至少一个物理服务器构成的计算机集群,本申请不作具体限定。It should be understood that the server shown in FIG. 13 may also be a computer cluster composed of at least one physical server, which is not specifically limited in this application.
图14是本申请提供的一种存储阵列1400,该存储阵列1400可以是前述内容的存储节点220。其中,该存储阵列1400包括存储控制器1410和至少一个存储器1420,其中,存储控制器1410和至少一个存储器1420通过总线1430相互连接。FIG. 14 is a storage array 1400 provided by the present application. The storage array 1400 may be the storage node 220 of the foregoing content. The storage array 1400 includes a storage controller 1410 and at least one storage 1420, where the storage controller 1410 and the at least one storage 1420 are connected to each other through a bus 1430.
存储控制器1410包括一个或者多个通用处理器,其中,通用处理器可以是能够处理电子指令的任何类型的设备,包括CPU、微处理器、微控制器、主处理器、控制器以及ASIC等等。处理器1410执行各种类型的数字存储指令,例如存储在存储器1420中的软件或者固件程序,它能使存储阵列1400提供多种服务。The storage controller 1410 includes one or more general-purpose processors, where the general-purpose processor can be any type of device capable of processing electronic instructions, including a CPU, a microprocessor, a microcontroller, a main processor, a controller, and an ASIC, etc. Wait. The processor 1410 executes various types of digital storage instructions, such as software or firmware programs stored in the memory 1420, which enables the storage array 1400 to provide multiple services.
存储器1420用于存储程序代码,并由存储控制器1410来控制执行,以执行上述图1-图11中任一实施例中存储节点210的处理步骤。程序代码可以包括一个或多个软件模块,这一个或多个软件模块可以为图1实施例中提供的存储节点220的软件单元,如元数据生成单元,其中,元数据生成单元用于从存储节点220根据待读取文件,获得待读取文件的元数据,待读取文件的元数据包括待读取文件的切片数量、行数、以及每个切片在待读取文件中的起始位置。具体可用于执行图5实施例中的S510-步骤S520及其可选步骤,还可以用于执行图1-图11实施例描述的其他由存储节点执行的步骤,这里不再进行赘述。存储器1420还用于存 储程序数据。其中,程序数据包括待读取文件和待读取文件的元数据,图14以程序代码存储于存储器1、程序数据存储于存储器n为例进行了说明,本申请不对此进行限定。The memory 1420 is used to store program codes, and is controlled by the storage controller 1410 to execute, so as to execute the processing steps of the storage node 210 in any one of the embodiments in FIG. 1 to FIG. 11 described above. The program code may include one or more software modules. The one or more software modules may be a software unit of the storage node 220 provided in the embodiment of FIG. The node 220 obtains the metadata of the file to be read according to the file to be read. The metadata of the file to be read includes the number of slices, the number of rows, and the starting position of each slice in the file to be read. . Specifically, it can be used to perform steps S510 to S520 and optional steps in the embodiment of FIG. 5, and can also be used to perform other steps performed by the storage node described in the embodiments of FIG. The memory 1420 is also used to store program data. Wherein, the program data includes the file to be read and the metadata of the file to be read. FIG. 14 takes the program code stored in the memory 1 and the program data stored in the memory n as an example for illustration, which is not limited in this application.
存储器1420可以是非易失性存储器,例如ROM、快闪存储器、HDD或SSD存储器还可以包括上述种类的存储器的组合。例如,存储阵列1400可以是由多个HDD或者多个SDD组成,或者,存储阵列1400可以是由多个HDD以及ROM组成。其中,至少一个存储器1420在存储控制器1410的协助下按不同的方式组合起来形成存储器组,从而提供比单个存储器更高的存储性能和提供数据备份技术。The memory 1420 may be a non-volatile memory, such as ROM, flash memory, HDD, or SSD memory, and may also include a combination of the foregoing types of memory. For example, the storage array 1400 may be composed of multiple HDDs or multiple SDDs, or the storage array 1400 may be composed of multiple HDDs and ROMs. Among them, at least one memory 1420 is combined in different ways with the assistance of the memory controller 1410 to form a memory group, thereby providing higher storage performance than a single memory and providing data backup technology.
应理解,图14所示的存储阵列1400还可以是至少一个存储阵列构成的一个或者多个数据中心,上述一个或者多个数据中心可以设置在同一个地点,或者,分别在不同的地点,此处不作具体限定。It should be understood that the storage array 1400 shown in FIG. 14 may also be one or more data centers composed of at least one storage array, and the above-mentioned one or more data centers may be located at the same location, or at different locations. There are no specific restrictions.
需要说明的,图14仅仅是本申请实施例的一种可能的实现方式,实际应用中,存储阵列1400还可以包括更多或更少的部件,这里不作限制。关于本申请实施例中未示出或未描述的内容,可参见前述图1-图11实施例中的相关阐述,这里不再赘述。It should be noted that FIG. 14 is only a possible implementation of the embodiment of the present application. In practical applications, the storage array 1400 may also include more or fewer components, which is not limited here. Regarding the content that is not shown or described in the embodiments of the present application, please refer to the relevant descriptions in the foregoing embodiments of FIG. 1 to FIG. 11, which will not be repeated here.
本申请还提供一种包括图13所述服务器1300和图14所述存储阵列1400的系统,该系统用于实现上述图1至图11中所述方法中相应主体的操作步骤,为了避免重复,此处不再赘述。This application also provides a system including the server 1300 described in FIG. 13 and the storage array 1400 described in FIG. I won't repeat them here.
本申请实施例还提供一种计算机可读存储介质,计算机可读存储介质中存储有指令,当其在处理器上运行时,图1-图11所示的方法流程得以实现。The embodiment of the present application also provides a computer-readable storage medium, which stores instructions in the computer-readable storage medium, and when it runs on a processor, the method flow shown in FIG. 1 to FIG. 11 is implemented.
本申请实施例还提供一种计算机程序产品,当计算机程序产品在处理器上运行时,图1-图11所示的方法流程得以实现。The embodiment of the present application also provides a computer program product. When the computer program product runs on a processor, the method flow shown in FIG. 1 to FIG. 11 can be realized.
上述实施例,可以全部或部分地通过软件、硬件、固件或其他任意组合来实现。当使用软件实现时,上述实施例可以全部或部分地以计算机程序产品的形式实现。计算机程序产品包括至少一个计算机指令。在计算机上加载或执行计算机程序指令时,全部或部分地产生按照本发明实施例的流程或功能。计算机可以为通用计算机、专用计算机、计算机网络、或者其他可编程装置。计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(Digital Subscriber Line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含至少一个可用介质集合的服务器、数据中心等数据存储节点。可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,高密度数字视频光盘(Digital Video Disc,DVD)、或者半导体介质。半导体介质可以是SSD。The foregoing embodiments may be implemented in whole or in part by software, hardware, firmware or any other combination. When implemented using software, the above-mentioned embodiments may be implemented in the form of a computer program product in whole or in part. The computer program product includes at least one computer instruction. When the computer program instructions are loaded or executed on the computer, the processes or functions according to the embodiments of the present invention are generated in whole or in part. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices. Computer instructions can be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, computer instructions can be transmitted from a website, computer, server, or data center through a cable (such as Coaxial cable, optical fiber, Digital Subscriber Line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) to transmit to another website, computer, server or data center. The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage node such as a server or a data center that includes at least one set of available media. The usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium (for example, a high-density digital video disc (Digital Video Disc, DVD)), or a semiconductor medium. The semiconductor medium may be an SSD.
以上,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以权利要求的保护范围为准。The above are only specific embodiments of the present invention, but the protection scope of the present invention is not limited thereto. Any person skilled in the art can easily think of various equivalent modifications or changes within the technical scope disclosed in the present invention. Replacement, these modifications or replacements should all be covered within the protection scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims (18)

  1. 一种数据处理方法,其特征在于,应用于数据处理系统,所述数据处理系统包括计算节点和存储节点,所述方法包括:A data processing method, characterized in that it is applied to a data processing system, the data processing system includes a computing node and a storage node, and the method includes:
    所述计算节点获取待读取文件的元数据,其中,所述待读取文件的元数据包括所述待读取文件的行数以及每个切片在所述待读取文件中的起始位置;The computing node obtains metadata of the file to be read, where the metadata of the file to be read includes the number of lines of the file to be read and the starting position of each slice in the file to be read ;
    所述计算节点根据所述每个切片在所述待读取文件中的起始位置,并发读取所述每个切片的数据;The computing node concurrently reads the data of each slice according to the starting position of each slice in the file to be read;
    所述计算节点按照所述每个切片在所述待读取文件中的起始位置的顺序,将所述每个切片的数据存储至内存空间,其中,所述内存空间是所述计算节点根据所述行数申请得到的。The computing node stores the data of each slice in a memory space according to the order of the starting position of each slice in the file to be read, where the memory space is determined by the computing node according to The number of rows mentioned is obtained by application.
  2. 根据权利要求1所述的方法,其特征在于,所述待读取文件的元数据是所述存储节点根据所述待读取文件的数据类型确定所述待读取文件的元数据格式后,根据所述元数据格式和所述待读取文件生成的,其中,不同的数据类型的待读取文件的元数据格式不同。The method according to claim 1, wherein the metadata of the file to be read is after the storage node determines the metadata format of the file to be read according to the data type of the file to be read, Generated according to the metadata format and the file to be read, wherein the metadata format of the file to be read is different for different data types.
  3. 根据权利要求1或2所述的方法,其特征在于,所述待读取文件的元数据存储于所述待读取文件中,所述待读取文件的末尾包括所述元数据在所述待读取文件中的起始位置,所述计算节点获取待读取文件的元数据包括:The method according to claim 1 or 2, wherein the metadata of the file to be read is stored in the file to be read, and the end of the file to be read includes the metadata in the file to be read. The starting position in the file to be read, where the computing node obtains metadata of the file to be read includes:
    所述计算节点从所述待读取文件的末尾获得所述元数据在所述待读取文件中的起始位置;The computing node obtains the starting position of the metadata in the file to be read from the end of the file to be read;
    所述计算节点根据所述元数据在所述待读取文件中的起始位置,读取所述待读取文件的元数据。The computing node reads the metadata of the file to be read according to the starting position of the metadata in the file to be read.
  4. 根据权利要求1或2所述的方法,其特征在于,所述待读取文件的元数据存储于所述存储节点的指定路径。The method according to claim 1 or 2, wherein the metadata of the file to be read is stored in a designated path of the storage node.
  5. 根据权利要求1或2所述的方法,其特征在于,所述待读取文件的元数据存储位置与所述待读取文件的存储位置相同。The method according to claim 1 or 2, wherein the metadata storage location of the file to be read is the same as the storage location of the file to be read.
  6. 根据权利要求4或5所述的方法,其特征在于,所述待读取文件和所述待读取文件的元数据包括共同标识,所述计算节点获取待读取文件的元数据包括:The method according to claim 4 or 5, wherein the metadata of the file to be read and the file to be read includes a common identifier, and the computing node acquiring the metadata of the file to be read comprises:
    所述计算节点获取所述待读取文件的共同标识;Acquiring, by the computing node, the common identifier of the file to be read;
    所述计算节点根据所述待读取文件的共同标识,从所述指定路径或者所述待读取文件的存储位置获取所述待读取文件的元数据。The computing node obtains the metadata of the file to be read from the designated path or the storage location of the file to be read according to the common identifier of the file to be read.
  7. 根据权利要求1至6任一权利要求所述的方法,其特征在于,所述待读取文件的元数据包括校验信息,所述校验信息用于校验所述待读取文件的元数据存储至所述存储节点之后是否发生过变化,所述计算节点根据所述每个切片在所述待读取文件中的起始位置,调用多个线程,并发读取所述每个切片的数据之前,所述方法还包括:The method according to any one of claims 1 to 6, wherein the metadata of the file to be read includes verification information, and the verification information is used to verify the metadata of the file to be read. Whether the data has changed after being stored in the storage node, the computing node calls multiple threads according to the starting position of each slice in the file to be read, and concurrently reads the data of each slice Before the data, the method also includes:
    所述计算节点根据所述校验信息校验所述待读取文件的元数据在存储至所述存储节点后是否发生过变化;The computing node verifies, according to the verification information, whether the metadata of the file to be read has changed after being stored in the storage node;
    所述计算节点在所述待读取文件的元数据存储至所述存储节点之后未发生过变化的情况 下,根据所述每个切片在所述待读取文件中的起始位置,调用多个线程,并发读取所述每个切片的数据。In the case where the metadata of the file to be read has not changed since the metadata of the file to be read is stored in the storage node, the computing node invokes more than one slice according to the starting position of each slice in the file to be read. Threads, read the data of each slice concurrently.
  8. 根据权利要求1至7任一权利要求所述的方法,其特征在于,所述待读取文件的元数据还包括数据类型,在所述数据类型是稠密矩阵的情况下,所述元数据还包括特征值类型,所述特征值类型用于供所述计算节点初始化所述内存空间的数据结构;The method according to any one of claims 1 to 7, wherein the metadata of the file to be read further includes a data type, and when the data type is a dense matrix, the metadata is also Includes a feature value type, the feature value type is used for the computing node to initialize the data structure of the memory space;
    所述计算节点根据所述每个切片在所述待读取文件中的起始位置,调用多个线程,并发读取所述每个切片的数据之前,所述方法还包括:Before the computing node invokes multiple threads according to the starting position of each slice in the file to be read, and concurrently reads the data of each slice, the method further includes:
    所述计算节点根据所述数据类型初始化所述内存空间的数据结构。The computing node initializes the data structure of the memory space according to the data type.
  9. 根据权利要求1至8任一权利要求所述的方法,其特征在于,所述数据类型是稀疏矩阵的情况下,所述待读取文件包括数据值、数据列索引以及行数据量,所述待读取文件的元数据还包括值数量,所述值数量用于申请用于存放所述数据值以及所述数据列索引的第一内存空间,The method according to any one of claims 1 to 8, wherein when the data type is a sparse matrix, the file to be read includes a data value, a data column index, and a row data amount, and the The metadata of the file to be read also includes a value quantity, and the value quantity is used to apply for the first memory space for storing the data value and the data column index,
    所述计算节点根据所述每个切片在所述待读取文件中的起始位置,调用多个线程,并发读取所述每个切片之前,所述方法还包括:Before the computing node calls multiple threads according to the starting position of each slice in the file to be read, and before concurrently reading each slice, the method further includes:
    所述计算节点根据所述值数量申请用于存放所述数据值以及所述数据列索引的第一内存空间;The computing node applies for the first memory space for storing the data value and the data column index according to the number of values;
    所述计算节点根据所述行数申请用于存放所述行数据量的第二内存空间,根据所述第一内存空间和第二内存空间获得所述内存空间。The computing node applies for a second memory space for storing the amount of row data according to the number of rows, and obtains the memory space according to the first memory space and the second memory space.
  10. 根据权利要求9所述的方法,其特征在于,所述数据类型是稀疏矩阵的情况下,所述每个切片在所述待读取文件中的起始位置包括所述每个切片的数据列索引起始位置、所述每个切片的数据值起始位置以及所述每个切片的行数据量起始位置;The method according to claim 9, wherein in the case that the data type is a sparse matrix, the starting position of each slice in the file to be read includes the data column of each slice An index start position, a data value start position of each slice, and a row data amount start position of each slice;
    所述计算节点按照所述每个切片在所述待读取文件中的起始位置的顺序,将所述每个切片的数据存储至内存空间包括:The computing node storing the data of each slice in the memory space in the order of the starting position of each slice in the file to be read includes:
    所述计算节点根据所述每个切片的数据列索引起始位置的顺序以及所述每个切片的数据值的起始位置的顺序,将所述每个切片的数据列索引以及数据值存储至所述第一内存空间,根据所述每个切片的行数据量的起始位置的顺序,将所述每个切片的行数据量存储至所述第二内存空间。The computing node stores the data column index and data value of each slice according to the order of the starting position of the data column index of each slice and the order of the starting position of the data value of each slice. The first memory space stores the row data amount of each slice in the second memory space according to the sequence of the starting position of the row data amount of each slice.
  11. 根据权利要求1至10中任一权利要求所述方法,其特征在于,所述元数据还包括所述待读取文件的切片数量,所述计算节点根据所述每个切片在所述待读取文件中的起始位置,并发读取所述每个切片的数据,包括:The method according to any one of claims 1 to 10, wherein the metadata further includes the number of slices of the file to be read, and the computing node displays the number of slices in the file to be read according to each slice. Taking the starting position in the file and reading the data of each slice concurrently includes:
    所述计算节点调用多个线程并发读取所述每个切片的数据,所述多个线程的数量小于或等于所述切片数量。The computing node calls multiple threads to concurrently read the data of each slice, and the number of the multiple threads is less than or equal to the number of slices.
  12. 根据权利要求1至10中任一权利要求所述方法,其特征在于,所述计算节点根据所述每个切片在所述待读取文件中的起始位置,并发读取所述每个切片的数据,包括:The method according to any one of claims 1 to 10, wherein the computing node concurrently reads each slice according to the starting position of each slice in the file to be read Data, including:
    所述计算节点调用多个线程并发读取所述每个切片的数据,所述多个线程的数量与所述切片的数量相同。The computing node calls multiple threads to concurrently read the data of each slice, and the number of the multiple threads is the same as the number of the slices.
  13. 一种数据处理方法,其特征在于,应用于数据处理系统,所述数据处理系统包括计算节点和存储节点,所述方法包括:A data processing method, characterized in that it is applied to a data processing system, the data processing system includes a computing node and a storage node, and the method includes:
    所述存储节点获取待读取文件;The storage node obtains the file to be read;
    所述存储节点根据所述待读取文件,获得所述待读取文件的元数据,所述待读取文件的元数据包括所述待读取文件的切片数量、行数、以及每个切片在所述待读取文件中的起始位置,其中,所述行数用于供所述计算节点申请用于存放所述待读取文件的内存空间,所述切片数量用于供所述计算节点创建多个线程,所述每个切片在所述待读取文件中的起始位置用于供所述计算节点调用所述多个线程,并发读取所述每个切片的数据,并按照所述每个切片在所述待读取文件在中的起始位置的顺序,将所述每个切片的数据存储至所述内存空间;The storage node obtains metadata of the file to be read according to the file to be read. The metadata of the file to be read includes the number of slices, the number of rows, and each slice of the file to be read. In the starting position in the file to be read, the number of rows is used for the computing node to apply for memory space for storing the file to be read, and the number of slices is used for the calculation The node creates multiple threads, and the starting position of each slice in the file to be read is used for the computing node to call the multiple threads, read the data of each slice concurrently, and follow Storing the data of each slice in the memory space in the order of the starting position of each slice in the file to be read;
    所述存储节点存储所述待读取文件的元数据。The storage node stores metadata of the file to be read.
  14. 根据权利要求13所述的方法,其特征在于,所述存储节点对所述待读取文件进行解析,获得所述待读取文件的元数据包括:The method according to claim 13, wherein the storage node parses the file to be read, and obtains metadata of the file to be read comprises:
    所述存储节点对所述待读取文件进行解析,确定所述待读取文件的数据类型;The storage node parses the file to be read, and determines the data type of the file to be read;
    所述存储节点根据所述待读取文件的数据类型,确定所述待读取文件的元数据格式,其中,不同的数据类型的待读取文件的元数据格式不同;The storage node determines the metadata format of the file to be read according to the data type of the file to be read, wherein the metadata format of the file to be read is different for different data types;
    所述存储节点根据所述待读取文件的元数据格式和所述待读取文件,生成所述待读取文件的元数据。The storage node generates metadata of the file to be read according to the metadata format of the file to be read and the file to be read.
  15. 根据权利要求13或14所述的方法,其特征在于,所述存储节点存储所述待读取文件的元数据包括:The method according to claim 13 or 14, wherein the storage node storing the metadata of the file to be read comprises:
    所述存储节点将所述待读取文件的元数据存储于所述待读取文件中,所述待读取文件的末尾包括所述元数据在所述待读取文件中的起始位置,使得所述计算节点从所述待读取文件的末尾获得所述元数据在所述待读取文件中的起始位置后,根据所述元数据在所述待读取文件中的起始位置,读取所述待读取文件的元数据。The storage node stores the metadata of the file to be read in the file to be read, and the end of the file to be read includes the starting position of the metadata in the file to be read, After the computing node obtains the starting position of the metadata in the file to be read from the end of the file to be read, according to the starting position of the metadata in the file to be read To read the metadata of the file to be read.
  16. 根据权利要求13或14所述的方法,其特征在于,所述存储节点存储所述待读取文件的元数据包括:The method according to claim 13 or 14, wherein the storage node storing the metadata of the file to be read comprises:
    所述存储节点将所述待读取文件的元数据存储于所述存储节点的指定路径。The storage node stores the metadata of the file to be read in a designated path of the storage node.
  17. 根据权利要求13或14所述的方法,其特征在于,所述存储节点存储所述待读取文件的元数据包括:The method according to claim 13 or 14, wherein the storage node storing the metadata of the file to be read comprises:
    所述存储节点将所述待读取文件的元数据存储于所述待读取文件的存储位置。The storage node stores the metadata of the file to be read in the storage location of the file to be read.
  18. 一种数据处理系统,包括计算节点和存储节点,其特征在于,所述计算节点执行如权利要求1至12任一权利要求所述的方法,所述存储节点执行如权利要求13至17任一权利要求所述的方法。A data processing system, comprising a computing node and a storage node, wherein the computing node executes the method according to any one of claims 1 to 12, and the storage node executes the method according to any one of claims 13 to 17. The method of the claims.
PCT/CN2021/088588 2020-06-23 2021-04-21 Data processing method and system WO2021258831A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010581055.1A CN113835870A (en) 2020-06-23 2020-06-23 Data processing method and system
CN202010581055.1 2020-06-23

Publications (1)

Publication Number Publication Date
WO2021258831A1 true WO2021258831A1 (en) 2021-12-30

Family

ID=78964028

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/088588 WO2021258831A1 (en) 2020-06-23 2021-04-21 Data processing method and system

Country Status (2)

Country Link
CN (1) CN113835870A (en)
WO (1) WO2021258831A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117762873A (en) * 2023-12-20 2024-03-26 中邮消费金融有限公司 Data processing method, device, equipment and storage medium
WO2024103752A1 (en) * 2022-11-16 2024-05-23 工赋(青岛)科技有限公司 File transmission method, apparatus and system, electronic device, and storage medium

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117707588A (en) * 2022-09-09 2024-03-15 荣耀终端有限公司 Differential file restoring method and electronic equipment
CN115964353B (en) * 2023-03-10 2023-08-22 阿里巴巴(中国)有限公司 Distributed file system and access metering method thereof
CN117156172B (en) * 2023-10-30 2024-01-16 江西云眼视界科技股份有限公司 Video slice reporting method, system, storage medium and computer

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202152A (en) * 2016-06-23 2016-12-07 浪潮(北京)电子信息产业有限公司 The data processing method of a kind of cloud platform and system
US20180137172A1 (en) * 2016-11-17 2018-05-17 Sap Se Document Store with Non-Uniform Memory Access Aware High Performance Query Processing
CN109710572A (en) * 2018-12-29 2019-05-03 北京赛思信安技术股份有限公司 A kind of file sharding method based on HBase

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202152A (en) * 2016-06-23 2016-12-07 浪潮(北京)电子信息产业有限公司 The data processing method of a kind of cloud platform and system
US20180137172A1 (en) * 2016-11-17 2018-05-17 Sap Se Document Store with Non-Uniform Memory Access Aware High Performance Query Processing
CN109710572A (en) * 2018-12-29 2019-05-03 北京赛思信安技术股份有限公司 A kind of file sharding method based on HBase

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024103752A1 (en) * 2022-11-16 2024-05-23 工赋(青岛)科技有限公司 File transmission method, apparatus and system, electronic device, and storage medium
CN117762873A (en) * 2023-12-20 2024-03-26 中邮消费金融有限公司 Data processing method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN113835870A (en) 2021-12-24

Similar Documents

Publication Publication Date Title
WO2021258831A1 (en) Data processing method and system
CN107105009B (en) Job scheduling method and device for butting workflow engine based on Kubernetes system
US10205627B2 (en) Method and system for clustering event messages
CN110334075B (en) Data migration method based on message middleware and related equipment
US10120928B2 (en) Method and system for clustering event messages and managing event-message clusters
WO2021051627A1 (en) Database-based batch importing method, apparatus and device, and storage medium
CN110308984B (en) Cross-cluster computing system for processing geographically distributed data
CN111930489B (en) Task scheduling method, device, equipment and storage medium
US11409711B2 (en) Barriers for dependent operations among sharded data stores
CN115114370B (en) Master-slave database synchronization method and device, electronic equipment and storage medium
US11194522B2 (en) Networked shuffle storage
US9384086B1 (en) I/O operation-level error checking
CN112988884B (en) Big data platform data storage method and device
US11625192B2 (en) Peer storage compute sharing using memory buffer
US11951999B2 (en) Control unit for vehicle and error management method thereof
US11656972B1 (en) Paginating results obtained from separate programmatic interfaces
CN114020525A (en) Fault isolation method, device, equipment and storage medium
CN114547199A (en) Database increment synchronous response method and device and computer readable storage medium
CN115793957A (en) Method and device for writing data and computer storage medium
CN116594551A (en) Data storage method and device
CN113407562A (en) Communication method and device of distributed database system
CN102253940B (en) Method and device for processing data by tree view
CN113536075B (en) Data extraction method, device and storage medium
CN117435367B (en) User behavior processing method, device, equipment, storage medium and program product
CN111258748B (en) Distributed file system and control method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21829501

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21829501

Country of ref document: EP

Kind code of ref document: A1