WO2021258831A1

WO2021258831A1 - Data processing method and system

Info

Publication number: WO2021258831A1
Application number: PCT/CN2021/088588
Authority: WO
Inventors: 朱琦; 崔宝龙; 王俊捷
Original assignee: 华为技术有限公司
Priority date: 2020-06-23
Filing date: 2021-04-21
Publication date: 2021-12-30
Also published as: CN113835870A

Abstract

A data processing method and system. The data processing method is applied to the data processing system, and the data processing system comprises a computing node and a storage node. The data processing method comprises the following steps: a computing node acquiring metadata of a file to be read (S810); then, according to the starting position of each slice in said file, calling multiple threads, and concurrently reading data of each slice (S820); and finally, according to the order of the starting positions of the slices in said file, storing the data of each shard slice in a memory space (S830). By means of the method, when a computing node reads a file to be read, a memory space capable of accommodating said file can be applied for in one step according to metadata of said file, and said file can be concurrently read to improve the efficiency of data reading, thereby improving the efficiency of processing the whole AI or big data task.

Description

Data processing method and system

Technical field

This application relates to the computer field, and in particular to a data processing method and system.

Background technique

With the continuous development of science and technology, the massive amount of data generated in the era of information explosion has penetrated into every industry and business functional area today, and the fields of big data (big data) and artificial intelligence (AI) have also been developed. Become two very popular research directions.

When a computing node performs big data or AI tasks, it needs to load data files on other devices or platforms into the memory of the computing node, and then the computing node completes the relevant calculation processing of the big data or AI tasks based on the data in the memory . However, due to the large amount of data and the inability to read files concurrently, the efficiency of the computing node to read the file is very low, and the time required for the computing node to load the data file into the memory even exceeds the computing node to complete the big data based on the data in the memory Or the time required for AI tasks seriously affects the efficiency of big data or AI tasks.

Summary of the invention

This application provides a data processing method and system, which can improve the efficiency of reading files by computing nodes.

In the first aspect, a data processing method is provided, which is applied to a data processing system. The data processing system includes a computing node and a storage node. The data processing method includes the following steps: the computing node obtains metadata of a file to be read, and The data includes the number of lines in the file to be read and the starting position of each slice in the file to be read. Then, according to the starting position of each slice in the file to be read in the metadata, each slice is read concurrently Finally, according to the order of the starting position of each slice in the file to be read, the data of each slice is stored in the memory space, where the memory space is requested according to the number of rows in the metadata.

Since the storage node generates the metadata of the file to be read in advance, when the computing node reads the file to be read, it can obtain the number of rows of the file to be read and the number of rows in the file to be read according to the metadata of the file to be read. The starting position in the file, so as to achieve the purpose of one-time application of memory space and multiple threads to read the file concurrently, avoiding the waste of resources caused by multiple expansions of the memory space due to the inability to determine the number of lines of the file to be read. Concurrent reading of files greatly improves the speed at which computing nodes read files, and further improves the processing efficiency of big data and AI tasks.

In a possible implementation manner, the metadata of the file to be read may also include the number of slices, so before the computing node concurrently reads the data of each slice according to the starting position of each slice in the file to be read , The computing node can create multiple threads according to the number of slices, and then the computing node can call multiple threads to read the data of each slice concurrently. Simply put, when a storage node generates metadata, it can determine the number of slices x based on the hardware processing capacity of the computing node. When the computing node reads metadata, it will create y based on the number of slices x and the current processing capacity of the computing node. Threads, and call the y threads to read x slices concurrently.

Optionally, the number y of multiple threads may be equal to the number of slices x. At this time, each thread processes a slice, and y threads can read the file to be read in parallel to achieve an excellent processing state, which greatly improves the speed of the computing node to read the file, and further improves the processing of big data and AI tasks efficient.

Optionally, the number y of multiple threads can be less than the number of slices x. When the number of threads created is less than the number of slices, one thread can process one slice first, and then after each thread finishes reading 1 slice, continue Read the next slice from the remaining slices until all slices have been read. It is also possible that some threads process only one slice, and some threads process multiple slices. For threads that need to process p slices, they can read directly from the starting position of the current slice to the starting position of the p+1th slice. In this way, the purpose of processing multiple slices by one thread is achieved, and the purpose of concurrently reading multiple slices of the file to be read is realized when the number of threads is less than the number of slices.

The computing node can flexibly create the number of threads according to the current processing capacity. If the number of threads that the processor can currently create is equal to the number of slices, then multiple threads can be called to read multiple slices of the file to be read in parallel, of which one thread only Process a slice to achieve the best processing state, which greatly improves the efficiency of the computing node to read the file; if the number of threads that the processor can currently create is lower than the number of threads, then multiple slices of the file to be read can be read concurrently , One thread can process multiple slices, avoiding the possibility of concurrent read failure due to the current heavy load of the computing node and the reduction of processing capacity. The reduction in the number of threads will not affect the concurrent reading of files, ensuring the feasibility of the solution. In a possible implementation, the metadata of the file to be read is generated according to the metadata format and the file to be read after the storage node determines the metadata format of the file to be read according to the data type of the file to be read, Among them, different data types have different metadata formats for the files to be read.

The storage node parses the file to be read in advance, determines the metadata format of the file to be read according to the data type of the file to be read, generates metadata for reading the file to be read, and then converts the metadata of the file to be read The data is stored, so that when the computing node reads the file, it can effectively initialize the data structure of the memory according to the metadata of the file to be read, and read the file to be read concurrently, thereby improving the efficiency of reading the file. In addition, metadata is highly scalable, and metadata can be further added and enriched according to various information required when reading various types of data, making the applicability of the solution provided by this application very broad .

In another possible implementation, the metadata of the file to be read is stored in the file to be read, and the end of the file to be read includes the starting position of the metadata in the file to be read. In this way, the computing node When the storage node obtains the metadata of the file to be read, it can obtain the starting position of the metadata in the file to be read from the end of the file to be read, and then read according to the starting position of the metadata in the file to be read. Get the metadata of the file to be read.

Optionally, the metadata of the file to be read may be stored at the end of the file to be read, and the offset position of the metadata header and the check mask are written at the end of the file to be read. The verification mask is located before the offset of the metadata header, so that when the computing node reads the metadata, it can set the read pointer at the end of the file, and then read a certain range of content in reverse to determine whether the data in the range has a correction. If there is a check mask, set the pointer at the check mask, read the offset position of the metadata header in the forward direction, and then set the read pointer at the offset position of the metadata header. Obtain the metadata from the read data.

Store the metadata of the file to be read in the file to be read, the computing node can obtain the starting position of the metadata in the file to be read from the end of the file to be read, and then read the metadata without The storage node additionally divides resources to store the metadata of the file to be read, which facilitates file management of the storage node and reduces the management burden of the storage node.

In another possible implementation manner, the metadata of the file to be read is stored in a designated path of the storage node.

Optionally, the metadata storage location of the file to be read is the same as the storage location of the file to be read.

In a specific implementation, the file to be read and the metadata of the file to be read include a common identification, and the computing node obtaining the metadata of the file to be read from the storage node includes: the computing node obtains the common identification of the file to be read from the storage node; computing The node obtains the metadata of the file to be read from the designated path or the storage location of the file to be read according to the common identification of the file to be read.

After the storage node sets a common identifier for the file to be read and the corresponding metadata, it stores the metadata in the specified path or the storage location of the file to be read. In this way, when the computing node reads the metadata, it can use the common identifier from The metadata is obtained from the specified path or the storage location of the file to be read without modifying the reading logic of the file, which can be applied to more computing nodes.

In another possible implementation, the metadata of the file to be read includes verification information. The verification information is used to verify whether the metadata of the file to be read has changed after being stored in the storage node. The computing node can According to the starting position of each slice in the file to be read, multiple threads are called, and before the data of each slice is read concurrently, the verification information is used to verify the metadata to confirm that the metadata is stored in the storage node After that, if there is no data loss or damage, the file to be read is read concurrently according to the metadata. Specifically, the computing node may call multiple threads according to the starting position of each slice in the file to be read, and before concurrently reading the data of each slice, the above method further includes the following steps: the computing node according to the verification information Check whether the metadata of the file to be read has changed after it is stored in the storage node. If the metadata of the file to be read has not changed after being stored in the storage node, the file to be read is checked according to each slice. At the starting position in, multiple threads are called to concurrently read the data of each slice.

Optionally, the verification information may include a verification mask, a metadata verification value, a file verification value, a metadata format version, a file format version, etc., where the verification mask is used for the computing node to determine this Is the header of the metadata, so the check mask is usually located at the header of the metadata. The metadata check value is used by the computing node to determine whether the metadata has changed after it is stored in the storage node. If it changes, the metadata may be damaged or lost. The computing node can use other common data processing methods in the industry to read the data. Read the file. The file check value is used by the computing node to determine whether the file has changed after being stored in the storage node. If the change indicates that the file may be damaged or lost, the computing node can return a data processing failure message. The metadata format version is used by the computing node to determine whether it supports reading the data in this format version. If not, the computing node can use other data processing methods commonly used in the industry to read the file to be read. The file format version is used for the computing node to determine whether it supports reading the file of this format version. If it does not support it, the computing node can use other common data processing methods in the industry to read the file to be read. It should be understood that the above verification information may also include more or less content, which is not specifically limited in this application. In addition, the method for verifying the above verification information can use verification methods commonly used in the industry, such as hash verification, sha256 verification, etc., which are not specifically limited in this application.

Before the computing node calls multiple threads to concurrently read the file to be read based on the metadata, it can first read the verification information in the metadata header to determine whether the metadata has changed after being stored in the storage node. In the case of changes, metadata is used to read the file to be read, so as to avoid the occurrence of a situation in which the computing node reads the file according to the wrong metadata information due to the metadata change, and improves the solution provided by this application. feasibility.

In another possible implementation, the metadata of the file to be read also includes the data type. In the case that the data type is a dense matrix, the metadata also includes the eigenvalue type. The eigenvalue type is used for the computing node to initialize the memory For the spatial data structure, the computing node calls multiple threads according to the starting position of each slice in the file to be read, and before concurrently reading the data of each slice, it can also include the following steps: the computing node initializes the memory according to the data type Spatial data structure. The computing node can initialize the memory data structure according to the feature value type in the metadata, to ensure that the file to be read will not cause data processing failure due to the error of the memory data structure, and to improve the reading efficiency of the file to be read.

In another possible implementation, when the data type is a sparse matrix, since the storage form of the sparse matrix is: a total of 3 rows of characters are included, and each data is stored by the 3 rows of characters, and a row of characters represents each The “data column index” corresponding to the data, a row of characters represents the “data value” corresponding to each data, and a row of characters represents the “row data volume” corresponding to each data. Therefore, the metadata of the file to be read also includes the number of values, The number of values is used to apply for the first memory space for storing data values and data column indexes. The computing node calls multiple threads according to the starting position of each slice in the file to be read, and reads each slice concurrently. The above method also includes the following steps: the computing node applies for a first memory space for storing data values and data column indexes according to the number of values, applies for a second memory space for storing row data according to the number of rows, and according to the first memory space and The second memory space obtains the memory space.

In the case that the data type of the file to be read is a sparse matrix, the computing node can apply for memory space according to the number of values and rows in the metadata to ensure that the file to be read with the data type of the sparse matrix can apply for memory space at one time. There is no need to expand the memory space multiple times, which avoids waste of resources and improves the efficiency of reading files to be read.

In another possible implementation, when the data type is a sparse matrix, the starting position of each slice in the file to be read includes the starting position of the data column index of each slice and the data value of each slice The starting position and the starting position of the row data amount of each slice, the computing node stores the data of each slice in the memory space in the order of the starting position of each slice in the file to be read, including: the computing node according to each slice The order of the starting position of the data column index of each slice and the starting position of the data value of each slice, the data column index and data value of each slice are stored in the first memory space, according to the row data of each slice The order of the starting position of the amount, the row data amount of each slice is stored in the second memory space.

In the case that the data type of the file to be read is a sparse matrix, the computing node can include the starting position of the data column index of each slice and the data value of each slice according to the starting position of each slice in the file to be read The starting position and the starting position of the row data amount of each slice reads the three rows of data of the sparse matrix to ensure that the file to be read with the data type of the sparse matrix can also be read concurrently, improving the reading efficiency of the file to be read .

In the second aspect, another data processing method is provided, which is applied to a data processing system. The data processing system includes a computing node and a storage node. The above data processing method includes the following steps: the storage node obtains the file to be read, and Fetch the file to obtain the metadata of the file to be read, where the metadata of the file to be read includes the number of slices, the number of rows, and the starting position of each slice in the file to be read, where , The number of rows is used by the computing node to apply for memory space for storing the file to be read, the number of slices is used for the computing node to create multiple threads, and the starting position of each slice in the file to be read is used for the computing node Call multiple threads, read the data of each slice concurrently, and store the data of each slice in the memory space in the order of the starting position of each slice in the file to be read, and finally the storage node stores the data to be read Get the metadata of the file.

Since the storage node generates the metadata of the file to be read in advance, when the computing node reads the file to be read, it can determine the length of the file to be read, the number of slices, and the number of slices to be read according to the metadata of the file to be read. Read the starting position and other information in the file, so as to achieve the purpose of applying for memory space at one time and reading files concurrently by multiple threads, which not only avoids incorrect initialization of the memory space data structure and failure of data processing due to the inability to determine the data type. The problem also avoids the waste of resources caused by multiple expansions of the memory space due to the inability to determine the number of lines of the file to be read, and the file can be read concurrently, so that the speed of the computing node to read the file is greatly improved, and the big data is further improved And the processing efficiency of AI tasks.

In a possible implementation, the specific process for the storage node to obtain the metadata of the file to be read may be as follows: the storage node parses the file to be read, determines the data type of the file to be read, and then according to the data type of the file to be read The data type determines the metadata format of the file to be read. Different data types have different metadata formats for the file to be read. Finally, according to the metadata format of the file to be read and the file to be read, the file to be read is generated The metadata of the file.

In another possible implementation, the specific steps for the storage node to store the metadata of the file to be read may be as follows: the storage node stores the metadata of the file to be read in the file to be read, and the end of the file to be read Including the starting position of the metadata in the file to be read, so that the computing node obtains the starting position of the metadata in the file to be read from the end of the file to be read, and then according to the starting position of the metadata in the file to be read Start position, read the metadata of the file to be read.

The metadata of the file to be read can be stored at the end of the file to be read, and the offset position of the metadata header and the check mask are written at the end of the file to be read, where the check mask is located at Before the metadata header is offset, when the computing node reads the metadata, it can set the read pointer at the end of the file, and then read a certain range of content in reverse to determine whether the data in the range has a check mask. If there is a check mask, set the pointer at the check mask, read the offset position of the metadata header in the forward direction, then set the read pointer at the offset position of the metadata header, and read the data in the forward direction Get this metadata.

In another possible implementation manner, the specific steps of the storage node storing the metadata of the file to be read may be as follows: the storage node stores the metadata of the file to be read in a designated path of the storage node.

In another possible implementation manner, the specific steps of the storage node storing the metadata of the file to be read may be as follows: the storage node stores the metadata of the file to be read in the storage location of the file to be read.

In another possible implementation manner, the metadata of the file to be read and the file to be read include a common identifier, and the common identifier is used by the computing node to obtain the file to be read from a specified path or a storage location of the file to be read according to the common identifier. Get the metadata of the file.

It should be understood that this application provides the above two metadata storage methods. In specific implementation, the metadata storage method can be flexibly determined according to the application environment, so that the data processing methods and data processing methods provided in this application are more widely used.

In another possible implementation manner, the metadata of the file to be read includes verification information, and the verification information is used for the computing node to verify whether the metadata of the file to be read has changed after being stored in the storage node.

The storage node writes the verification information into the metadata header of the file to be read, so that the computing node can read the verification information in the metadata header before calling multiple threads to concurrently read the file to be read based on the metadata , To determine whether the metadata has been changed after it is stored in the storage node. If no changes have occurred, the metadata is used to read the file to be read, so as to avoid the metadata change that causes the computing node to The occurrence of metadata information to read files has improved the feasibility of the solution provided by this application.

In another possible implementation, the metadata of the file to be read also includes the data type. When the data type is a dense matrix, the metadata also includes the eigenvalue type. The eigenvalue type is used by the computing node according to the characteristic value. Type initializes the data structure of the memory space.

The storage node puts the feature value type into the metadata of the dense matrix, so that the computing node can initialize the memory data structure according to the feature value type in the metadata to ensure that the file to be read will not cause data processing failure due to memory data structure errors. Improve the efficiency of reading files to be read.

In another possible implementation, when the data type is a sparse matrix, since the storage form of the sparse matrix is: a total of 3 rows of characters are included, and each data is stored by the 3 rows of characters, and a row of characters represents each The “data column index” corresponding to the data, a row of characters represents the “data value” corresponding to each data, and a row of characters represents the “row data volume” corresponding to each data. Therefore, the metadata of the file to be read also includes the number of values, When the data type is a sparse matrix, the file to be read includes the data value, data column index, and row data volume. The metadata also includes the number of values. The number of values is used by the computing node to apply for storing data values and data column indexes. The first memory space, the number of rows is used by the computing node to apply for the second memory space for storing the amount of row data, and the memory space of the file to be read includes the first memory space and the second memory space.

When the data type of the file to be read is a sparse matrix, the storage node puts the data value into the metadata of the sparse matrix, and the computing node can apply for memory space according to the number of values and rows in the metadata to ensure that the data type is The file to be read in the sparse matrix can apply for memory space at one time without the need to expand the memory space multiple times, avoiding waste of resources and improving the reading efficiency of the file to be read.

In another possible implementation, when the data type is a sparse matrix, the starting position of each slice in the file to be read includes the starting position of the data column index of each slice and the data value of each slice The starting position and the starting position of the row data amount of each slice.

In a third aspect, a computing node is provided, which includes modules for executing the data processing method in the first aspect or any one of the possible implementation manners of the first aspect.

In a fourth aspect, a storage node is provided, which includes modules for executing the data processing method in the second aspect or any one of the possible implementation manners of the second aspect.

In a fifth aspect, a data processing system is provided, including a computing node and a storage node. The computing node is used to implement the operation steps of the method described in the first aspect or any one of the possible implementations of the first aspect. The storage node It is used to implement the operation steps of the method described in the second aspect or any one of the possible implementation manners of the second aspect.

In a sixth aspect, a computer program product is provided, which when running on a computer, causes the computer to execute the methods described in the above aspects.

In a seventh aspect, a computer-readable storage medium is provided, and instructions are stored in the computer-readable storage medium, which when run on a computer, cause the computer to execute the methods described in the foregoing aspects.

On the basis of the implementation manners provided in the above aspects, this application can be further combined to provide more implementation manners.

Description of the drawings

The following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art:

FIG. 1 is a schematic diagram of the architecture of a multi-core processor provided by the present application;

Fig. 2 is a schematic diagram of the architecture of a data processing system provided by the present application;

Figure 3 is a schematic structural diagram of a data processing system provided by the present application;

Fig. 4 is a schematic diagram of the flow of steps of a data processing method provided by the present application;

Figures 5 to 6 are schematic diagrams of the metadata format provided by this application;

Figure 7 is a format of a file to be read containing metadata provided by this application;

FIG. 8 is a schematic flowchart of steps of a data processing method provided by the present application;

FIG. 9 is a schematic flowchart of another data processing method provided by this application;

FIG. 10 is a schematic flowchart of another data processing method provided by this application;

FIG. 11 is a schematic flowchart of another data processing method provided by this application;

FIG. 12 is a schematic structural diagram of a computing node provided by this application;

FIG. 13 is a schematic diagram of the structure of a server provided by the present application;

FIG. 14 is a schematic structural diagram of a storage array provided by the present application.

detailed description

In order to facilitate the understanding of the technical solutions of this application, firstly, some terms involved in this application will be explained. It is worth noting that the terms used in the implementation mode of this application are only used to explain specific embodiments of this application, and are not intended to limit this application.

Big data: A collection of data that cannot be captured, managed, and processed with conventional software tools within a certain time frame. The strategic significance of big data technology lies in the professional processing of massive amounts of data. The processed data can be applied to various industries, including finance, automobiles, catering, telecommunications, energy, etc., for example, using big data technology and Internet of Things technology Of unmanned cars, using big data technology to analyze customer behavior for product recommendation, using big data technology to realize credit risk analysis, and so on.

Artificial Intelligence: Theories, methods, technologies and application systems that use digital computers or computing nodes controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results. The application scenarios of artificial intelligence are very wide, such as face recognition, vehicle recognition, pedestrian re-recognition, data processing applications, and so on. The underlying model of AI is a collection of mathematical methods to achieve AI. A large number of samples can be used to train the AI model to make the trained AI model obtain the ability to predict. Among them, the samples used to train the AI model can be from big data. Samples obtained by the platform.

Concurrency: Two or more events occur at the same time in the same period of time. In the task processing of the operating system, concurrency refers to multiple threads operating the same resource to process the same or different tasks in a period of time. It should be noted that concurrency includes multiple threads operating at the same time (parallel) within a period of time, and also includes multiple threads operating alternately in time-sharing within a period of time.

Core: The core of the processor is also called the core of the processor and is an important part of the processor. The kernel can be understood as the executable unit of the processor, and all tasks of the processor, such as calculation, receiving/storing commands, and data processing, are executed by the core.

Thread: Thread is the smallest unit that the operating system can perform operation scheduling. A core corresponds to at least one thread. Through hyper-threading technology, a core can also correspond to two or more threads, that is, multiple threads are running at the same time.

Multi-core processor: One or more cores can be deployed in the processor. If the number M of cores deployed in the processor is not less than 2, the processor is called a multi-core processor. Figure 1 is a schematic diagram of the structure of a multi-core processor chip. Figure 1 takes M=8 as an example for description. As shown in Figure 1, the eight cores of the multi-core processor 100 are divided into a first core 101 and a second core. 102, the third core 103, the fourth core 104, the fifth core 105, the sixth core 106, the seventh core 107, and the eighth core 108. Among them, the first core is the main core and is responsible for task scheduling. For example, according to factors such as the tasks that each core is suitable for processing and whether it is idle, tasks are reasonably allocated to other cores for processing. The multi-core processor also includes a memory 109 for storing data, such as double data rate synchronous dynamic random access memory (DDR SDRAM). Among them, each core and the memory are connected in a bus 110, and each core can access the data in the memory by sharing the memory. It should be understood that concurrent processing is the advantage of the multi-core processor, and the multi-core processor can call multiple threads in a specific clock cycle to concurrently process more tasks.

Multi-CPU multi-core processor: also known as multi-chip multi-core processor, this processor contains multiple multi-core processor chips as shown in Figure 1. Multiple multi-core processor chips are connected through an interconnect structure, and the interconnect structure can be implemented in a variety of ways, such as a bus.

The application scenarios involved in this application will be further introduced below in conjunction with the accompanying drawings.

Figure 2 is a schematic diagram of the architecture of a big data or AI task processing system. Figure 2 can also be referred to as a schematic diagram of the architecture of a data processing system. The data processing system is used for computing nodes to implement file reading processes and storage nodes to implement files The stored procedure. The system includes a computing node 210, a storage node 220, and a data collection node 230. The processors on the computing node 210 and the storage node 220 are usually the multi-core processor 100 or the multi-CPU multi-core processor shown in FIG. 1. The storage node 220, the data collection node 230, and the computing node 210 are connected through a network, and the network may be a wired network, a wireless network, or a mixture of the two.

Among them, the computing node 210 and the storage node 220 may be physical servers, such as X86 servers, ARM servers, etc.; they may also be virtual machines based on general physical servers combined with network functions virtualization (NFV) technology. machine, VM), the virtual machine refers to a function of a complete hardware system, a complete computer system running software simulation in a completely isolated environment, such as virtual machines in a cloud data center, the present application is not particularly limited. The storage node 220 may also be other storage devices with storage functions, such as a storage array. It should be understood that the computing node and the storage node 220 may be a single physical server or a single virtual machine, and may also constitute a computer cluster, which is not specifically limited in this application.

The data collection node 230 can be a hardware device, for example, a physical server or a cluster of physical servers, or software, for example, a data collection system deployed in a server, a virtual machine, and the data collection system can collect data stored in other servers. For example, the log information in the website server can be collected, and the data collected by other hardware devices can also be collected. It should be understood that the above examples are only for illustration, and this application is not specifically limited.

It is worth noting that FIG. 2 is a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship between nodes, modules, etc. shown in the figure does not constitute any limitation. For example, the computing node 210, the storage node 220, and the data collection node 230 in FIG. 2 are all described by taking three independent devices or server clusters as an example. In a specific implementation, the computing node 210, the storage node 220, and the data collection node The node 230 may also be the same server cluster or server, or the computing node 210 and the storage node 220 may be the same server cluster or server, etc., which is not specifically limited in this application.

In the system shown in FIG. 2, the data collection node 230 collects various raw data and sends them to the storage node 220. After the storage node 220 performs data processing on the received raw data, the file to be read is generated and stored in In the storage node 220, it should be understood that since the source of the original data is very wide and the data structure is very complex, the storage node 220 needs to "translate" the original data into a unified format that can be directly read and written by the processor for storage. , Data processing may include data cleaning, feature extraction, format conversion, etc., which is not specifically limited in this application. The computing node 210 reads various files to be read from the storage node 220 and loads them into the memory 109 of the computing node 210. The multi-core processor 100 of the computing node 210 completes big data or AI tasks according to the data in the memory 109 Related operations. Fig. 2 illustrates that the second core 102 completes the AI task and the third core 103 completes the big data task as an example. In a specific implementation, the multi-core processor 100 can process multiple tasks concurrently, and multiple cores can process multiple tasks in a specific clock cycle. This application does not specifically limit the processing of the same AI task, the same big data task, or the same data processing task.

For example, suppose that the data collection node 230 is a cloud server deployed with specific services (for example, Kafka and/or Flume), where Kafka is used to provide a high-throughput and highly scalable distributed message queue service, and Flume is a high-throughput, highly-scalable distributed message queue service. Reliable, highly available, distributed massive log collection, aggregation and transmission system. The storage node 220 is a computer cluster deployed with a distributed file system (hadoop distributed file system, HDFS). Among them, the storage node 220 can also be deployed with a data processing system, such as Spark. Among them, Spark is a large-scale data processing system. Unified analysis engine. The computing node 210 is a computer cluster deployed with Spark-ML, where Spark-ML is used to process machine learning (ML) tasks.

In the above example, the cloud server (data collection node 230) deployed with Kafka and/or Flume can first generate massive amounts of raw data, and save the raw data in HDFS (storage node 220), which can be read by Spark of storage node 220 Perform data processing on the original data, such as feature extraction and format conversion on the original data, convert the original data into a data format that can be processed by machine learning or big data, generate the file to be read and save it in HDFS. Finally, Spark-ML (computing node 230) reads the file to be read from HDFS and loads it into the memory 109. The multi-core processor 100 performs machine learning tasks based on the memory data in the memory 109, such as k-means clustering algorithm (k-means clustering algorithm, K-means) or linear regression (linear regression) processing.

In summary, when the computing node 210 is performing tasks such as big data and machine learning, it needs to first read the file to be read from the storage node 220, and load the file to be read into the memory 109 of the computing node 210 (in Figure 2) Step 1), the computing node 210 then completes the related operations of the big data or machine learning tasks according to the data in the memory 109 (step 2 in FIG. 2).

Next, the data processing method provided in this application will be further introduced with reference to FIG. 3 by taking the data processing method as an example.

This application provides a data processing system 400 as shown in FIG. 3. It should be understood that using the data processing system 400 shown in FIG. 3 to perform data processing in the application scenario shown in FIG. 2 can greatly improve the data of the computing node 210. The processing speed further improves the efficiency of the computing node 210 in processing big data or AI tasks.

As shown in FIG. 3, the data processing system 400 includes a computing node 210 and a storage node 220. The specific form and connection manner of the computing node 210 and the storage node 220 can be implemented with reference to FIG. 1, and details are not repeated here.

The storage node 220 includes a metadata generating unit 221, which is used to generate metadata of the file to be read. The metadata records basic information of the file to be read. The basic information includes at least the number of lines of the file to be read, The maximum number of slices and the starting position of each slice in the file to be read. For example, the maximum number of slices in the file to be read is 3, the number of rows is 9, and the starting position of slice 1 is the first line of the file to be read. , The starting position of slice 2 is the 4th line of the file to be read, and the starting position of slice 3 is the 7th line of the file to be read. In a specific implementation, the metadata may also include more information, such as the type of feature value, the number of columns, etc., which may be specifically determined according to the data type of the file to be read, which is not specifically limited in this application.

It should be noted that the metadata generating unit 221 only records the maximum number of slices of the file to be read and the starting position of each slice in the file to be read. The file to be read is not actually sliced. The unsliced state is completely stored in the storage node 220. In addition, metadata can be stored in the storage node together with the file to be read in the form of a separate file, or it can be integrated with the file to be read into a data processing in the storage node. The specific storage process of metadata will Step S520 in the embodiment of FIG. 4 is described below.

In specific implementation, the metadata generating unit 221 may generate corresponding metadata based on the original data when the storage node 220 receives the original data, or perform data processing on the original data at the storage node 220 (such as the aforementioned data cleaning, After feature extraction and format conversion, etc.), before the file to be read is generated, corresponding metadata is generated for the data after data processing. After the storage node 220 has generated the file to be read, the corresponding metadata can be generated according to the file to be read. Metadata, this application does not limit the input data of the metadata generating unit 221.

The computing node 210 includes a metadata reading unit 211 and a slice reading unit 212. The metadata reading unit 211 is used to read the metadata of the file to be read, and the slice reading unit 212 is used to determine the metadata to be read according to the metadata. Take the number of lines of the file, the number of slices x, and the starting position of each slice in the file to be read, apply for a memory space for storing the file to be read according to the number of lines, and then send data read requests to y threads (y is an integer less than or equal to x. For example, the number of slices is 3, and the number of threads can be 1 or 2 or 3. When y is equal to x, multiple threads can read multiple slices of the file to be read in parallel), where , Each data read request carries the starting position of a slice in the file to be read and the address of the previously applied memory space. For example, the data read request received by thread 1 carries the value of slice 1 in the file to be read. Starting position, the data read request received by thread 2 carries the starting position of slice 2 in the file to be read, and the data reading request received by thread 3 carries the starting position of slice 3 in the file to be read. Finally, in response to the data read request, the y threads concurrently read the slices of the file to be read according to the starting position of the received slice, and follow the order of the starting position of each slice in the file to be read , Write the read slice into the above memory space.

It is worth noting that Figure 3 takes one core corresponding to one thread as an example (for example, in Figure 3, core 1 corresponds to thread 1, core 2 corresponds to thread 2, and core 3 corresponds to thread 3). In the specific implementation, if the computing node The 210 multi-core processor or multi-chip multi-core processor uses hyper-threading technology. A core can also correspond to multiple threads. For example, core 1 corresponds to thread 1 and thread 2, core 2 corresponds to thread 3, or core 1 corresponds to thread 1~ Thread 3 and so on, so as to achieve the purpose of multiple cores to read files concurrently, improve resource utilization, and improve data processing efficiency.

Still taking the foregoing example as an example, suppose that the data collection node 230 is a cloud server deployed with Kafka and/or Flume, the storage node 220 is a computer cluster deployed with HDFS and Spark, and the computing node 210 is a computer cluster deployed with Spark-ML. Then the above-mentioned metadata generating unit 221 may be deployed in Spark, and the metadata reading unit 211 and the slice reading unit 212 may be deployed in Spark-ML.

In the above example, the cloud server (data collection node 230) deployed with Kafka and/or Flume can first generate a large amount of raw data, and save the raw data in HDFS (storage node 220), and the Spark of storage node 220 can be read first Take the original data for data processing, such as feature extraction and format conversion of the original data, and then generate the file to be read and the corresponding metadata based on the data after data processing, and then combine the file to be read and the corresponding metadata Stored in HDFS. Finally, when Spark-ML (computing node 230) reads the file to be read from HDFS, it first reads the metadata of the file to be read, and then applies for a continuous segment based on the information in the metadata. In the memory space, multiple threads are called to concurrently read the file to be read, loaded in the previously requested memory space, and then perform machine learning tasks based on the memory data in the memory 109. When the computing node 230 reads the file to be read, it can not only read concurrently, but also avoid resource waste caused by multiple applications for memory and multiple copies of data, which greatly improves the efficiency of data processing.

It should be noted that before the metadata reading unit 211 reads the metadata, it will determine whether the file to be read has corresponding metadata. If the file to be read does not have metadata, it can notify the slice in a thread. The reading unit 212 reads the file to be read according to the current data processing method in the industry, which is not limited in this application.

In summary, in the data processing system provided by this application, the storage node 220 in the system generates metadata of the file to be read before the computing node 210 reads the file to be read, so that the computing node 210 can read the file to be read. When fetching a file, you can determine the length of the file to be read, the number of slices, and the starting position of each slice in the file to be read based on the metadata of the file to be read, so as to achieve a one-time application for memory space. The purpose of threads to read files concurrently not only avoids the problem of incorrect initialization of the memory space data structure and failure of data processing due to the inability to determine the data type, but also avoids multiple expansions of the memory space due to the inability to determine the number of lines of the file to be read The resulting waste of resources, and the ability to read files concurrently, greatly improves the speed at which the computing node 210 reads files, and further improves the processing efficiency of big data and AI tasks.

The following explains the data processing method and the data processing method applicable to the above-mentioned data processing system 400 provided in the present application.

With reference to the foregoing content, before the computing node 210 reads the file, the storage node 220 needs to generate corresponding metadata according to the file to be read, and then store the file to be read and the corresponding metadata in the storage node 220. Therefore, the following first With reference to Fig. 5, the data processing method provided in this application will be described in detail.

As shown in FIG. 5, the specific process of generating metadata by the metadata storage node 220 may include the following steps:

S510: Obtain the file to be read from the data collection node 230, and parse the file to be read to obtain metadata of the file to be read.

It is understandable that if the metadata information is insufficient, the computing node 210 may still encounter the problem of low data processing efficiency when reading files. If the metadata is too rich, the time required for the computing node 210 to read the metadata will increase and decrease Metadata reading efficiency, the information contained in the metadata has a great impact on the efficiency of subsequent data processing. For this reason, this application provides a variety of metadata formats to adapt to various application scenarios. In specific implementation, after the storage node parses the file to be read, it can first determine the data type of the file to be read, and then determine the metadata format of the file to be read according to the data type of the file to be read. Among them, different data Types of files to be read have different metadata formats, and finally, according to the metadata format and the analysis result of the file to be read, the metadata of the file to be read is generated.

The following is a brief description of the format of the metadata provided in this application.

With reference to the foregoing, it can be seen that the metadata records the basic information of the file to be read. The basic information includes at least the number of lines of the file to be read, the maximum number of slices, and the starting position of each slice in the file to be read. Therefore,

Exemplarily, the format of the metadata may be as shown in FIG. 5, where the format of the metadata includes at least basic information 610, and the basic information 610 includes:

(1) The number of rows is used to identify the total number of rows contained in the file to be read, for the computing node 210 to apply for memory space for storing the file to be read.

(2) The number of slices is used to identify the number of slices contained in each file to be read, for the computing node 210 to apply for multiple threads to concurrently read the file to be read.

It should be noted that the number of slices is usually the maximum number of slices of the file to be read, and the maximum number of slices is an empirical value. It is understandable that if the number of slices of the file to be read is too large, the metadata length of the file to be read will be too large, which will affect the speed of the computing node 210 to read the metadata. If the number of slices of the file to be read is too small , Will cause a part of the cores to remain idle when the computing node 210 concurrently reads the file to be read, which causes a waste of resources. Therefore, the maximum number of slices of the file to be read can be determined according to the number of cores of the computing node 210. For example, the maximum number of slices is equal to the number of processor cores of the computing node 210, or the maximum number of slices is proportional to the number of processor cores. There is no specific limitation.

(3) The starting position of each slice is used for each thread to read the file to be read concurrently. Each thread can read a file to be read according to the starting position of a slice in the file to be read And put it into the previously requested memory space, thereby completing the concurrent reading of the file to be read and improving the reading efficiency of the file to be read.

In specific implementation, the starting position of each slice can be the offset value and line number of the starting position of each slice in the file to be read, and each thread can be based on the line number and the starting position of the next slice. Line number, determine the length l of the slice, then set the read pointer to the offset value, and read the slice with length l. Of course, the starting position of each slice can also include more or less content, such as only the offset value of the starting position of each slice in the file to be read, or the starting position of each slice also includes The length of each slice is not limited in this application.

In one embodiment, due to data missing or changed in the storage node 220, for example, the information part of the metadata is missing, or the data content of the file to be read is changed, etc., which affects the computing node 210 According to the efficiency of concurrently reading the file based on the metadata, the metadata may also include verification information, which is used to improve the reliability of the metadata.

Optionally, as shown in FIG. 5, in addition to the above-mentioned basic information 610, the metadata may also include verification information 620, where the verification information 620 includes:

(4) The check mask is used for the computing node 210 to confirm that this is the header of the metadata. Therefore, the check mask is located at the header of the metadata. When the computing node 210 starts to read the metadata from the metadata header , The check mask of the metadata header can be checked first, and this application does not make specific restrictions. If the computing node 210 succeeds in verifying the check mask, it proves that the current position of the read pointer is the head of the metadata, the computing node 210 can start to read the metadata, and call multiple threads to read the pending data concurrently according to the metadata. Read the file; if the computing node 210 fails to verify the verification mask, it means that the current pointer is not at the head of the metadata, and the computing node 210 can no longer use the metadata to read the file to be read. Instead, the slice reading unit 212 is called to read the file to be read according to the current data processing method in the industry, and this application is not limited to this. In specific implementation, the check mask can be represented by a binary value to speed up the processing efficiency;

(5) Metadata check value, used to check whether the content of metadata information has changed;

(6) The file check value is used to check whether the data content in the file to be read has changed;

(7) Metadata format version, used to record the format version of the current metadata information. When the computing node reads the metadata, if it does not support reading the metadata information in the latest format, it can also be compatible with the old version of the file;

(8) The file format version is used to record the format information of the file currently to be read.

It is worth noting that when the computing node 210 reads the metadata, it can read the verification information 620 first, and after confirming that the metadata and the data content of the file to be read have not changed, and the version format is compatible, the basic data can be read again. Information 610, and then call multiple threads to read the file to be read concurrently. Therefore, the verification information 620 in the metadata format shown in FIG. 5 is located before the basic information 610. Of course, other methods can also be used to ensure that the computing node reads first The verification information 620 then reads other information of the metadata, which is not specifically limited in this application.

It should be understood that the verification information (4) to (8) in FIG. 5 are used for illustration, and the metadata may also include more or fewer types of verification information to ensure the reliability of the metadata, which is not specifically limited here. The verification methods used in (4) to (6) above can use verification methods commonly used in the industry, such as hash verification, sha256 verification, etc., which are not specifically limited in this application.

In one embodiment, the computing node needs different information when reading files to be read of different data types. For example, in the AI field, the data type of the file to be read is usually a dense matrix or a sparse matrix. When the data type of the file is a dense matrix, the computing node 210 needs to initialize the memory data structure according to the string type of the characteristic value of each column of the dense matrix to ensure that the file to be read will not be parsed or lost; and when the file to be read When the data type of is a sparse matrix, the computing node 210 does not need to obtain the eigenvalues of each column of the matrix. Instead, it needs to apply for memory space for storing the "data value" and "data column index" according to the number of sparse matrix values. Different types of metadata formats will also change. The following uses the dense matrix data type as an example to describe the metadata format.

Optionally, as shown in FIG. 5, in addition to the above-mentioned basic information 610 and verification information 620, the metadata may also include type information 630. It should be understood that metadata of different data types has different type information 630, as shown in FIG. 5. Taking the data type as a "dense matrix" as an example for description, when the data type is a dense matrix, the type information 630 includes:

(9) Data type, used to describe the name of the data type of the file to be read, Figure 5 here takes the data type as "dense matrix" as an example for illustration.

(10) Eigenvalue type, used to describe the eigenvalue type of the dense matrix, such as the type is a string, etc., different types of eigenvalues need different data structure memory space to store, so the computing node 210 can be based on the dense matrix The type of the characteristic value initializes the data structure of the memory space to ensure that the file to be read will not be parsed or lost.

It is worth noting that since the computing node 210 reads files to be read of different data types, it will execute different reading logic to read the files to be read. For example, a dense matrix requires additional initialization of the data structure of the memory space. The data type 630 in 5 is located before the basic information 610. In this way, the computing node 210 first verifies the metadata and the file to be read according to the verification information 620, and then determines the reading logic of the computing node 210 according to the type information 630, and finally The basic information 610 and reading logic call multiple threads to concurrently read the file to be read. Of course, other methods can also be used to ensure the order of reading various metadata information, which is not specifically limited in this application.

It should be understood that the data type of the file to be read is different, the metadata format is also different, and the content in the type information 630 is also different. For example, as shown in FIG. 6, if the (9) data type of the metadata is "sparse matrix ", the type information 630 will not include (10), but additionally include:

(11) The number of values is used to store the number of values of the sparse matrix. The computing node 210 can apply for memory space according to the number of values of the sparse matrix. It should be understood that since the storage form of the sparse matrix is: a total of 3 rows of characters are included, and each data is saved by the 3 rows of characters, a row of characters represents the "data column index" corresponding to each data, and a row of characters represents each data corresponding "Data value", a line of characters represents the "row data amount" corresponding to each data. Therefore, for a sparse matrix, (1) the number of rows is used to apply for the first memory space for storing the "row data amount", (11 ) Value quantity is used to apply for the second memory space for storing "data value" and "data column index".

In addition, in the basic information 610 of metadata of the file to be read whose data type is a sparse matrix, (3) the starting position of each slice will be further divided into:

(3.1) The starting position of the data column index of each slice;

(3.2) The starting position of the data value of each slice;

(3.3) The starting position of the row data amount of each slice.

In this way, each thread can read the data volume index, data value, and corresponding row data volume of a slice according to the starting position of the three-row data of a slice, and write the slice in the three-row format of the sparse matrix to the above-mentioned application The memory space, specifically, the computing node 210 can call multiple threads to read concurrently according to the starting position of the data column index of each slice, the starting position of the data value of each slice, and the starting position of the row data volume of each slice. The data value of each slice and the data column of each slice are indexed to the first memory space, and multiple threads are called to concurrently read the row data amount of each slice to the second memory space, and the file to be read is obtained to realize multiple threads concurrency The purpose of reading multiple slices.

In an embodiment, in consideration of processor processing performance, in some application scenarios, when the computing node 210 reads a file to be read whose data type is a sparse matrix, the data type of the file to be read may be changed from The sparse matrix is converted into a dense matrix and then stored in the memory space. In the conversion process, the computing node 210 needs to know the number of columns of the sparse matrix and the original number of rows of each data in advance. The original number of rows here refers to the original data being converted into The number of rows in the original data where the sparse matrix is stored before the storage node 220. Therefore, when the data type is a sparse matrix, the type information 630 may also include (12) the number of columns and (3.3) each slice The starting position of the row data volume of each slice includes the offset value of the row data volume of each slice and the original number of rows. In this way, each thread can read the data volume index of the slice according to the starting position of the three rows of data of each slice, The data value and the corresponding row data amount, and the slice is written into the memory space according to the number of rows and columns of the original data, so that multiple threads can read multiple slices of the sparse matrix concurrently, and convert the sparse matrix into a dense matrix. The purpose of entering the memory space.

It should be understood that the metadata formats shown in Figures 5 to 6 are only used for illustration. In specific implementations, the solution provided by this application is not only applicable to the above-mentioned data types (sparse matrix and dense matrix), but also applicable to other data types that can be itemized. Or data types read in batches, such as data in Libsvm format, will not be given examples and explanations one by one here. In addition, the metadata of different data types can also include more or less content. Specifically, the content that the metadata needs to contain can be determined according to the information required by the computing node when reading the file to be read, which will not be expanded here. Go into details. S520: Store metadata and files to be read.

The storage node 220 stores the metadata in a designated path, or stores the metadata in the storage location of the file to be read, where the metadata of the file to be read and the file to be read contain a common identifier, such as the file to be read and the file to be read. The file name of the metadata of the file to be read is the same, but the extension is different. For example, the storage path of the file to be read (dataA.exp) is /pathA/pathB/.../pathN/dataA.exp, where exp is the general data format of the file to be read, specifically csv, libsvm, etc. Etc., assuming that the metadata extension is metadata, the storage path of the metadata (dataA.metadata) of the file to be read is pathA/pathB/.../pathN/dataA.metadata. In this way, when the computing node 210 reads the file to be read, it can directly search for the metadata corresponding to the file to be read that contains the common identifier from the reading path of the file to be read. Of course, the storage node 220 may also store the metadata of all files in a specified path. When the computing node 21 reads the file to be read, it may search for the metadata corresponding to the file to be read from the specified path according to the common identifier.

Optionally, the storage node 220 may also store the metadata of the file to be read in the file to be read, and the end of the file to be read includes the starting position of the metadata in the file to be read. In this way, the computing node 210 When reading metadata, you can read a certain length of data directly from the end of the file to be read to determine the position of the header of the metadata in the file to be read, which can be the offset value of the metadata header, and then Set the read pointer to the offset value of the metadata header for reading, thereby obtaining the metadata of the file to be read.

Exemplarily, the metadata is appended to the end of the file to be read, and the format of the file to be read containing the metadata may be as shown in FIG. 7. Among them, assuming that the original file has a total of N lines of data, metadata is appended to the end of the file to be read, and (13) check mask and (14) metadata header offset position are also appended to the end of the metadata. in,

(13) Check mask. The check mask is generally located before "(14) Metadata Head Offset Position", and is used for the computing node 210 to confirm the first position of (14). The computing node 210 can read the file from the file to be read. Read a certain range of content in the reverse direction at the end of the to determine whether the content in the range has a check mask of the target format (13), if there is a check mask of the target format (13), you can continue to read (14) The offset position of the metadata header;

(14) The offset position of the metadata header is used for the computing node 210 to determine the position of the metadata header in the file to be read. In the example shown in FIG. 7, the offset position of the metadata header may be Line N+1.

Simply put, when the computing node 210 reads the file to be read, it can first set the read pointer to the end of the file, and then read the content in a certain range of the tail file in a reverse manner, and perform pattern matching on it to determine the Whether there is a check mask in the target format for the content in the range, if there is no check mask in the target format, the computing node 210 will read the file to be read using a data processing method commonly used in the industry, and if there is a check mask in the target format Mask, set the read pointer to the check mask, read the data in the forward direction to obtain the offset position information of the metadata header, then set the read pointer to the offset position, and then read the metadata. The data calls multiple threads to read the file to be read concurrently.

For example, the check mask can be "#HWBDFORMAT", and the offset position of the metadata information header can be #12345678. When the computing node 210 reads the file to be read, it can first set the read pointer at the end of the file, and then Reversely read the content in a certain range of the tail file to determine whether the content in the range has the fixed format #HWBDFORMAT, if there is a check mask in this format, then read the (14 ) The metadata header offset position, and then set the pointer to the offset position "12345678" to start reading metadata.

It should be noted that the format of the file to be read containing metadata shown in FIG. 7 is only for illustration, and this application does not specifically limit it.

This application provides the above two metadata storage methods. In specific implementation, the metadata storage method can be selected according to the application environment. It is understandable that the metadata is stored under the same file name in the storage path of the file to be read In this method, the data processing logic of the computing node does not need to be modified, and the reusability is strong, but it will increase the burden of file management on the storage node 220; and the method of directly appending the metadata to the end of the file to be read does not require Generate redundant files to facilitate file management of the storage node 220, but the data processing logic of the computing node needs to be modified so that the computing node can first read the metadata from the end of the file, and then read the file to be read based on the metadata. If the computing node 210 cannot modify the data processing logic, and the metadata needs to be stripped before it can be used by the computing node 210. Therefore, in specific implementation, the storage mode of metadata can be flexibly determined according to the application environment, so that the data processing method and the data processing method provided in this application are more widely used.

It is understandable that in the data processing method provided by the present application, the storage node 220 parses the file to be read in advance, determines the metadata format of the file to be read according to the data type of the file to be read, and generates a metadata format for reading the file to be read. Read the metadata of the file, and then store the metadata of the file to be read, so that when the computing node reads the file, it can effectively initialize the data structure of the memory according to the metadata of the file to be read, and read the file to be read concurrently Fetch files to improve the efficiency of file reading. In addition, metadata is highly scalable, and metadata can be further added and enriched according to various information required when reading various types of data, making the applicability of the solution provided by this application very broad .

The method for the computing node 210 to read the file to be read will be explained below. The data processing method provided in this application can be applied to the computing node 210 of the data processing system 400 described in FIG. 4, as shown in FIG. 8, the method includes the following steps:

S810: The computing node 210 obtains metadata of the file to be read from the storage node 220, where the metadata of the file to be read includes the number of slices, the number of rows, and the number of slices in the file to be read. starting point.

With reference to the foregoing content, there are two ways to store metadata. Therefore, there are also two ways for the computing node 210 to obtain the metadata of the file to be read. The two metadata obtaining methods will be explained separately below.

In one embodiment, if the storage node 220 stores metadata in a manner that the metadata of the file to be read is stored in the specified path of the storage node, or the metadata storage location of the file to be read is different from the metadata storage location of the file to be read. Is stored as the same value, then when storing the metadata of the file to be read, the metadata of the file to be read and the file to be read include a common identifier, for example, the file to be read and the corresponding metadata have the same file name and different formats At this time, step S810 may include the following steps: the computing node 210 obtains from the storage node 220 the common identifier of the file to be read, such as the file name of the file to be read, and then according to the file name of the file to be read, from the specified path or the file to be read Get the storage location of the file to obtain the metadata of the file to be read. If the metadata file exists, read the metadata file, apply for memory space based on the metadata file, create threads, and call threads to concurrently read the file to be read; if the metadata file does not exist, use the data processing common in the industry The method for data processing is not specifically limited in this application.

Still taking the foregoing example as an example, suppose that the storage node 220 generates the file to be read dataA.exp and the corresponding metadata dataA.metadata, that is, the file to be read and the metadata are jointly identified as the same file name, and then the file to be read is identified as the same file name. The read file and metadata are stored together in /pathA/pathB/.../pathN, then when the computing node 210 reads the file dataA.exp to be read, it can be stored in the storage path /pathA/pathB/.../pathN of dataA.exp Find the metadata with the same name as the file to be read, that is, dataA.metadata, or find whether the metadata file exists according to the storage path /pathA/pathB/.../pathNdataA.metadata, if the metadata file exists, read it File, and read information based on metadata. If the metadata file does not exist, data processing methods commonly used in the industry are used for data processing, and this application does not specifically limit it.

In an embodiment, if the storage node 220 stores metadata in a manner that the metadata of the read file is stored in the file to be read, such as the end of the file to be read, step S810 may include the following steps: Read the end of the file to obtain the starting position of the metadata in the file to be read, which can be specifically the offset value of the metadata header, and read the metadata according to the offset value of the metadata header.

Still taking the content format shown in FIG. 7 as an example, when the computing node 210 reads the file to be read in the format shown in FIG. 7, it can first set the read pointer to the end of the file, and then read the end in a reverse manner. The content within a certain range of the file, and pattern matching is performed to determine whether the content in the range has the (13) check mask of the target format. If there is no (13) check mask of the target format, the computing node 210 The file to be read will be read using the data processing method commonly used in the industry. If the (13) check mask of the target format exists, then (13) the metadata header offset after the check mask (14) will be read Position, set the read pointer to the offset position, and then read the metadata.

It should be noted that no matter what method is used to obtain the metadata, if the metadata file does not exist, the computing node 210 can use the data processing method commonly used in the industry for data processing, and perform data analysis on the file to be read, and then return the analysis result To the storage node 220, so that the storage node 220 generates metadata of the file to be read according to the analysis result. In this way, when another computing node 210 reads the file to be read, the storage node 220 can return the metadata to the computing node 210 , So that the computing node concurrently reads the file to be read based on the metadata.

S820: The computing node calls multiple threads according to the starting position of each slice in the file to be read, and concurrently reads the data of each slice, where the multiple threads are created by the computing node according to the number of slices.

Optionally, the number of threads y may be equal to the number of slices x. At this time, each thread processes a slice, and y threads can read the file to be read in parallel to achieve an excellent processing state, which greatly improves the speed of the computing node to read the file, and further improves the processing of big data and AI tasks efficient.

Optionally, the number of threads y may be less than the number of slices x. With reference to the foregoing, it can be seen that the number of slices x of the file to be read is determined according to the hardware processing capability of the computing node 210, and when the computing node 210 reads the file to be read, the computing node 210 may currently partially process other matters. For example, a big data task or an AI task is in progress. At this time, the number of threads y that the computing node 210 can create may be less than the number of slices x.

For example, if the metadata shows that the maximum number of slices of the file to be read is 10 and the number of cores of the computing node 210 is 10, if all the cores of the current computing node 210 are idle, the computing node 210 can directly create 10 threads. Call 10 threads to read the slices of the file to be read in parallel to achieve the best processing state. The computing node reads the file the fastest and has the highest processing efficiency; if the current 3 cores of the computing node 210 are processing big data tasks, only If the 7 cores are in an idle state, the computing node 210 can create 7 threads G1 to G7, and call the 7 threads to concurrently read 10 slices of the file to be read. It should be understood that the above examples are only for illustration, and this application does not make specific limitations.

S830: The computing node stores the data of each slice in the memory space according to the order of the starting position of each slice in the file to be read, where the memory space is obtained by the computing node according to the number of rows.

With reference to the foregoing content, the starting position of each slice in the file to be read can be the offset value and line number of the starting position of each slice in the file to be read. Therefore, each thread reads the value of the slice. After data is collected, multiple threads can be called to write multiple slices into the memory space concurrently according to the size sequence of the offset value of the start position of the slice or the size sequence of the row number.

In specific implementation, when the number of threads created is less than the number of slices, one thread can process one slice first, and then after each thread finishes reading 1 slice, continue to read the next slice from the remaining slices until All slices have been read. Still taking the above example as an example, the computing node 210 creates 7 threads G1 to G7 to read the file to be read, and the number of slices of the file to be read is 10, then the threads G1 to G7 can concurrently read slices 1 to 1 first. Slice 7, after thread 1 processes slice 1, continue to take a slice from the remaining slices for reading. For example, slice 8 is to be processed, then thread 1 processes slice 1 and continues to process slice 8, and other threads follow the same strategy. Execute until all slices are processed. It should be understood that the above examples are only for illustration, and this application does not make specific limitations.

In specific implementation, when the number of threads created is less than the number of slices, some threads can process only one slice, and some threads process multiple slices, so as to achieve the purpose of processing multiple slices in parallel. With reference to the foregoing content, the starting position of the slice can include the offset and line number of the starting position of each slice in the file to be read, and each thread can be based on the line number of the starting position of the slice to be read. And the line number of the starting position of the next slice to determine the length of the slice to be read, so that some threads can read multiple slices from the starting position of the current slice according to the length of the current slice and the length of the next slice . Still taking the above example as an example, the number of threads is 7 and the number of slices is 10, then 4 slices can be allocated for concurrent reading by thread 1 to thread 4, and 6 slices can be allocated for concurrent reading by thread 5 to thread 7. Among them, Thread 5 can read from the start position of the fifth slice to the start position of the seventh slice, thread 6 can read from the start position of the seventh slice to the start position of the ninth slice, thread 7 It can be read from the beginning of the 9th slice to the end of the file. It should be understood that the above examples are used for illustration, and this application is not specifically limited.

For example, as shown in Figure 9, suppose there are 9 rows of data in the file to be read, and each row of data is represented by L ₁ to L ₉ respectively. Assume that the metadata of the file to be read is: (1) Number of rows = 9 ; (2) the number of slices = 3; (3) the starting position of each slice = the offset value w _{1 of} slice 1 and the line number 1; the offset value w _{4 of} slice 2 and the line number 4; the offset of slice 3 Shift value w ₇ and line number 7. Therefore, as shown in FIG. 9, after the computing node 210 reads the metadata of the file to be read, the computing node 210 can apply for 3 threads G1 to G3 according to the number of slices 3, and then apply for a segment from the memory 109 according to the number of rows 9 _{The memory space n 0} accommodating 9 rows of data, and then call 3 threads to read the file to be read to the memory space n ₀ concurrently. Among them, thread G1 reads slice 1, thread G2 reads slice 2, thread G3 reads slice 3. Specifically, thread G1 determines slice 1 according to the row number 1 of slice 1 and the row number 4 of the next slice (slice 2). The length of is 3 lines, thread G2 determines the length of slice 2 is 3 lines according to the line number 3 of slice 2 and the line number 7 of the next slice (slice 3), and thread G3 determines the length of slice 2 according to the line number 7 of slice 4 and the total number of rows 9. The length of slice 3 is 3 lines, and then thread G1 sets the read pointer to the offset value w ₁ and reads 3 lines of data L ₁ ～L ₃ to the first three lines of the memory space n ₀ , and thread G2 sets the read pointer to the offset Shift w ₄ and read 3 rows of data L ₄ ～L ₆ to rows 3 to 6 of the memory space n _0. Thread G3 sets the read pointer to the offset value w ₇ and reads 3 rows of data L ₇ ～L ₉ to In the last three lines of the memory space n ₀ , threads G1, G2, and G3 process the above tasks concurrently, thereby completing a concurrent file reading.

In summary, in the data processing method provided by this application, the storage node 220 generates metadata of the file to be read in advance before the computing node 210 reads the file to be read, so that the computing node 210 reads the file to be read from the storage node 220 When fetching a file, you can determine the length of the file to be read, the number of slices, and the starting position of each slice in the file to be read based on the metadata of the file to be read, so as to achieve a one-time application for memory space. The purpose of threads to read files concurrently not only avoids the problem of incorrect initialization of the memory space data structure and failure of data processing due to the inability to determine the data type, but also avoids multiple expansions of the memory space due to the inability to determine the number of lines of the file to be read The resulting waste of resources, and the ability to read files concurrently, greatly improves the speed at which the computing node 210 reads files, and further improves the processing efficiency of big data and AI tasks.

The above steps S810 to S830 are the general data reading method provided by this application. With reference to the foregoing content, it can be seen that the metadata format of the file to be read is different for different data types, so the data reading process in different application scenarios has subtleties. To make this application better understood, the following combines a specific application scenario, and the storage node 220 stores the file to be read and the corresponding metadata under the same file name under the same path, and the file to be read is stored in the same path. The data type of the file is a dense matrix, and the metadata format is shown in FIG. 5 as an example. The reading process of the aforementioned computing node 210 reading the file to be read according to the metadata is described in detail.

As shown in FIG. 10, in this application scenario, the process for the computing node 210 to obtain the metadata of the file to be read from the storage node 220 may be as follows:

S1001: Obtain the read path of the file to be read. /pathA/pathB/pathC/.../pathN/dataA.exp, where exp is a general data format, such as scv, libsvm, etc.

S1002: Search whether the metadata corresponding to the file to be read exists in the same path or the designated path according to the common identifier, if it exists, execute step S1003, if it does not exist, execute step S1011. Assuming that the metadata extension is metadata, you can search /pathA/pathB/pathC/.../pathN/data.metadata in the same path to determine whether the metadata dataA.metadata of the file dataA.exp to be read exists.

S1003: Open and load the metadata file.

S1004: Obtain (4) the check mask of the metadata file, and verify (4) the check mask. If the check mask is successfully verified, it indicates that the position is the head of the metadata file. Start to read the metadata file, that is, perform step S1005; if the verification mask fails to verify, it indicates that the position is not the head of the metadata file, and the computing node 210 can stop reading the metadata, and read the pending data through other methods. To read the file, step S1011 is executed.

S1005: Obtain (5) the metadata check value and verify it. If the metadata check value is successfully verified, it indicates that the metadata has not been changed after being stored in the storage node 220, and the computing node 210 can follow The content in the metadata reads the file to be read, and continues to step S1006; if the metadata check value verification fails, it indicates that the metadata may have been changed due to data loss or other reasons, and the computing node 210 can stop reading Metadata, and step S1011 is executed.

In specific implementation, the metadata check value can be generated according to certain rules according to the data length and other information when the metadata is stored. In this way, when the computing node 210 is reading the metadata, it can be based on the data length of the current metadata, etc. The information generates a check value for verification according to the same rules. If the check value is equal to (5) the metadata check value, it proves that the metadata information has not changed, and step S1006 can be continued; if not, It proves that the metadata information may have changed due to data loss and other reasons. It should be understood that the above (5) implementation of the metadata check value is only used for illustration, and this application does not specifically limit the check method of the metadata.

S1006: Obtain (6) the file check value and verify it. If the file check value is successfully verified, it means that the file to be read has not been changed after storage. Continue to step S1007; file check value verification In the case of failure, it means that the file to be read may have been changed due to data loss or other reasons after being stored. The computing node 210 may stop reading the file to be read and return the message that the reading has failed, that is, step S1012 is executed.

In specific implementation, the computing node 210 may first determine whether the file check value is valid, so as to avoid that some storage nodes 220 do not generate the file check value, resulting in (6) the file check value part is a meaningless character string. Therefore, If the file check value is invalid, you can directly execute step S1007. If the check value is valid, the check value can be verified. If the file check value is successfully verified, continue to step S1007; file check value verification In the case of failure, the computing node 210 may stop reading the file to be read, and return information that the reading has failed, that is, step S1012 is executed.

S1007: Obtain (7) metadata format version, (8) file format version, and (9) data type, for example, the format version is V1, the file format is CSV, and the data type is dense matrix, to determine whether the current computing node 210 supports processing The metadata format version is V1, the file format is CSV, and the data type is a file to be read with a dense matrix. If it is supported, the computing node 210 may execute step S1008, and if it is not supported, step S1011 may be executed.

S1008: Apply for memory space for loading the file to be read according to the number of rows (1), and initialize the data structure of the memory space according to (10) the characteristic value type.

S1009: The computing node 210 obtains the number of slices (2) as x, and creates y threads according to the number of cores currently owned by the processor and the processing capability of the processor, where y is less than or equal to x. Of course, you can also set in advance the number of threads used to read files as y', if y'is not greater than x, you can apply for y'threads for data processing, if y'is greater than x, you can apply for x threads for processing data processing.

S1010: Each thread concurrently reads all slices to the aforementioned memory space in the order of the queue.

If the number of threads is equal to the number of slices, then thread 1 can read slice 1, thread 2 can read slice 2, and so on, so that multiple threads can read multiple slices in parallel, which greatly improves the reading of the file to be read To improve the efficiency, the processing efficiency of the entire big data or AI task is improved.

If the number of threads is less than the number of slices, for example, the number of threads is 8 and the number of slices is 16, then one thread processes a slice first, and after each thread processes the current slice, it continues to take a slice from the remaining slices to continue processing, such as thread 1. After processing slice 1, slice 9 is to be processed, thread 1 can continue to process slice 9, and other threads can also execute according to the same strategy until all slices are processed. Specifically, the above process can be achieved through round-robin scheduling. , I won’t go into details here.

Of course, after determining the length of each slice based on the starting position of each slice, all the slices can be directly allocated to all threads. Still taking the above example as an example, the number of threads is 8 but the number of slices is 16. The length of each slice is l ₁ ～l ₁₆ , and then thread 1 is allocated to read slice 1 to 2. Thread 1 reads _{data of length l 1} +l ₂ from the starting position of slice 1, and reads slice 1 and slice 2. To the memory space, thread 2 directly reads _{data of length l 3} +l ₄ from the starting position of slice 3, reads slice 3 and slice 4 to the memory space, etc., which are not specifically limited in this application.

S1011: The computing node 210 uses other methods to read the file to be read, such as other data processing methods commonly used in the industry, which are not specifically limited here.

S1012: The computing node 210 stops reading the file to be read, and returns information that there is an error in the data of the file to be read and the reading has failed.

It is understandable that the foregoing data processing method stores the metadata of the file to be read in the storage node 220 in advance, so that when the computing node 210 reads the file to be read from the storage node 220, the memory space can be effectively initialized according to the metadata. To avoid reading failure due to data structure errors, you can also apply for a memory space that can accommodate the file to be read at one time based on the metadata, avoiding the waste of resource occupation caused by multiple expansions of the memory space, and you can also read the waiting data concurrently according to the metadata. Read files, improve the efficiency of data reading, and then improve the processing efficiency of the entire AI task and big data tasks. In addition, more information can be added to the metadata to increase functional requirements such as data security and reliability, and it is highly scalable.

The following describes the above steps S810 to S830 with an example in conjunction with another specific application scenario. In this application scenario, the storage node 220 stores the metadata at the end of the file to be read in the manner shown in FIG. The data type of the file is taken as a sparse matrix, and the metadata format is shown in FIG. 6 as an example. The reading process of the aforementioned computing node 210 reading the file to be read according to the metadata is described in detail.

As shown in FIG. 11, in this application scenario, the process for the computing node 210 to obtain the metadata of the file to be read from the storage node 220 may be as follows:

S1101: Open the file to be read.

S1102: After determining the file size (File Size), set the current read pointer to the end of the file;

S1103: Reversely read the content in a certain range of the tail file, and determine whether there is a matching format (that is, the format of (13) check mask) in the content within the range. If it exists, prove that the position is a meta The (13) check mask of the data can execute step S1104. If it does not exist, it means that no metadata is added to the file, and the computing node 210 can use a general data processing method for data processing, that is, step S1112 is executed.

S1104: Obtain the metadata header offset position (14) after the check mask (13), shift the read pointer to the metadata header offset value, and start reading the metadata file;

S1105: Obtain (4) the check mask in the metadata, and perform a second verification of (4) the check mask to further confirm whether the position is the head position of the metadata, and the check mask is successfully verified In this case, step S1106 is executed; when the verification of the check mask fails, step S1112 is executed. For details, please refer to the aforementioned step S1004, which will not be repeated here.

S1106: Obtain (5) the metadata check value and verify it. If the metadata check value is successfully verified, proceed to step S1107; if the metadata check value fails to verify, perform step S1112 . For details, please refer to the aforementioned step S1005, which will not be repeated here.

S1107: Obtain the file check value (6) and verify it. If the file check value is successfully verified, it means that the file to be read has not been changed after being stored. Continue to step S1108; file check value verification In the case of failure, it means that the file to be read may have been changed due to data loss or other reasons after storage, and the computing node 210 may stop reading the file to be read, and step S1113 is executed. For details, please refer to the aforementioned step 1012, which will not be repeated here.

S1108: Obtain (7) metadata format version, (8) file format version, and (9) data type, for example, the format version is V2, the file format is CSV, and the data type is sparse matrix, to determine whether the current computing node 210 supports processing The metadata format version is V2, the file format is CSV, and the data type is a file to be read with a sparse matrix. If it is supported, the computing node 210 can perform step S1109, and if it is not supported, step S1112 is performed.

S1109: Apply for memory space for storing data values and data column indexes according to the number of rows (10), and apply for memory space for storing row data according to (1) the number of rows.

S1110: The computing node 210 obtains the number of slices (2) as x, and then creates y threads according to the number of cores currently owned by the processor and the processing capability of the processor, where y is not less than or equal to x. For details, please refer to the aforementioned step S1009, which will not be repeated here.

S1111: Each thread concurrently reads multiple slices of the file to be read into the memory space. For details, please refer to step S1010 of the foregoing content, and details are not repeated here.

It should be noted that, for a file to be read whose data type is a sparse matrix, when the computing node 210 calls multiple threads to read the file to be read concurrently, it can be based on the starting position of the data column index of each slice and the value of each slice. The starting position of the data value and the starting position of the row data volume of each slice, call multiple threads to read the data value of each slice and the index of the data column of each slice to the first memory space, call multiple threads to read concurrently Take the row data amount of each slice to the second memory space to obtain the file to be read.

In addition, in consideration of processor processing performance, in some application scenarios, the computing node 210 needs to convert the sparse matrix into a dense matrix, and then load it into the memory space. Therefore, each thread can be based on the metadata in the (1) The number of rows, (12) the number of columns, (10) the number of values and other information are converted into a dense matrix and then written into the memory space. For details, please refer to the embodiment in FIG. 6, and details are not repeated here.

It is understandable that if the method provided in this application is not used to read the sparse matrix, the computing node 210 needs to read the entire file to be read, and then first parse out the number of rows, columns, and values of the file to be read. , And then convert the sparse matrix into a dense matrix. Using the method provided by this application, multiple threads can directly convert the slices into a dense matrix when reading the slices concurrently according to the number of rows, columns, and values in the metadata. The format is written into the memory space, thereby avoiding the process of converting all the sparse matrices into a dense matrix after reading all the sparse matrices, and improving the reading efficiency of the file to be read of the sparse matrix data type.

S1112: The computing node 210 uses other methods to read the file to be read, such as other data processing methods commonly used in the industry, which are not specifically limited here.

S1113: The computing node 210 stops reading the file to be read, and returns information that the data of the file to be read has an error and the reading has failed.

It is understandable that the foregoing data processing method stores metadata of the file to be read in the storage node 220 in advance, so that when the computing node 210 reads the file to be read from the storage node 220, the memory space can be effectively initialized according to the metadata. , To avoid data structure errors leading to read failures, you can also apply for the memory space used to store the files to be read at one time based on the metadata, avoid the waste of resource occupation caused by multiple expansions of the memory space, and you can also read concurrently based on the metadata Take the file to be read to improve the efficiency of data reading, thereby improving the processing efficiency of the entire AI task and big data task. According to the metadata, the file to be read with the data type of sparse matrix can be directly converted into a dense matrix and loaded into The memory improves the efficiency of reading the sparse matrix, and the metadata can be appended with more information to adapt to the reading of more types of data files, which makes the data processing method applicable to a very wide range.

The methods of the embodiments of the present application are described in detail above. In order to facilitate better real-time implementation of the above-mentioned solutions in the embodiments of the present application, correspondingly, related equipment for cooperating with the implementation of the above-mentioned solutions is also provided below.

FIG. 12 is a schematic structural diagram of a computing node 210 provided by the present application. The computing node 210 is applied to the data processing system 400 shown in FIG. 3, and the computing node 210 includes:

The metadata reading unit 211 is configured to obtain metadata of the file to be read, where the metadata of the file to be read includes the number of slices, the number of rows, and the number of slices in the file to be read. starting point;

The slice reading unit 212 is configured to call multiple threads according to the starting position of each slice in the file to be read, and concurrently read the data of each slice, wherein the multiple threads are created by the computing node according to the number of slices ；

The slice reading unit 212 is further configured to store the data of each slice in the memory space according to the order of the starting position of each slice in the file to be read, where the memory space is obtained by the computing node according to the number of rows.

Optionally, the metadata of the file to be read is generated according to the metadata format and the file to be read after the storage node determines the metadata format of the file to be read according to the data type of the file to be read, where different data The metadata format of the file to be read is different.

Optionally, the metadata of the file to be read is stored in the file to be read, the end of the file to be read includes the starting position of the metadata in the file to be read, and the metadata reading unit 211 is used to read from the file to be read. The end of the file is taken to obtain the starting position of the metadata in the file to be read; the metadata reading unit 211 is configured to read the metadata of the file to be read according to the starting position of the metadata in the file to be read.

Optionally, the metadata of the file to be read is stored in a designated path of the storage node.

Optionally, the file to be read and the metadata of the file to be read include a common identification, and the metadata reading unit 211 is used to obtain the common identification of the file to be read from the storage node; the metadata reading unit 211 is used to obtain the common identification of the file to be read according to the Read the common identification of the file, and obtain the metadata of the file to be read from the specified path or the storage location of the file to be read.

Optionally, the metadata of the file to be read includes verification information. The verification information is used to verify whether the metadata of the file to be read has changed after being stored in the storage node. The slice reading unit 212 is used to Each slice is at the starting position in the file to be read, multiple threads are called, and before the data of each slice is read concurrently, the metadata of the file to be read is verified according to the verification information whether it occurs after it is stored in the storage node The slice reading unit 212 is used to call multiple slices according to the starting position of each slice in the file to be read when the metadata of the file to be read has not changed after being stored in the storage node Thread, read the data of each slice concurrently.

Optionally, the metadata of the file to be read also includes a data type. When the data type is a dense matrix, the metadata also includes a feature value type. The feature value type is used for the computing node to initialize the data of the memory space according to the feature value type. Structure, the slice reading unit 212 is used to initialize the data structure of the memory space according to the data type before calling multiple threads according to the starting position of each slice in the file to be read, and reading the data of each slice concurrently.

Optionally, when the data type is a sparse matrix, the file to be read includes the data value, data column index, and row data amount. The metadata also includes the number of values, and the number of values is used to apply for storing data values and data column indexes. The slice reading unit 212 is used to call multiple threads according to the starting position of each slice in the file to be read, and before concurrently reading each slice, apply for storing data values according to the number of values And the first memory space of the data column index; the slice reading unit is used to apply for the second memory space for storing the amount of row data according to the number of rows, and obtain the second memory space for storing the file to be read according to the first memory space and the second memory space Memory space.

Optionally, when the data type is a sparse matrix, the starting position of each slice in the file to be read includes the starting position of the data column index of each slice, the starting position of the data value of each slice, and each The starting position of the row data amount of the slice; the slice reading unit 211 is used to store the data of each slice in the memory space according to the order of the starting position of each slice in the file to be read, according to each slice The order of the starting position of the data column index and the starting position of the data value of each slice, the data column index and data value of each slice are stored in the first memory space, according to the amount of row data of each slice In the order of the starting position, the amount of row data of each slice is stored in the second memory space.

It should be understood that the computing node 210 in the embodiment of the present application may be implemented by an application-specific integrated circuit (ASIC) or a programmable logic device (PLD), and the above PLD may be complex program logic. Device (complex programmable logical device, CPLD), field-programmable gate array (field-programmable gate array, FPGA), generic array logic (generic array logic, GAL) or any combination thereof. When the data processing method shown in FIG. 1 to FIG. 11 can also be implemented by software, the computing node 210 and its various modules can also be software modules.

The computing node 210 according to the embodiment of the present application may correspond to executing the method described in the embodiment of the present application, and the foregoing and other operations and/or functions of each unit in the computing node 210 are to implement each of FIGS. 1 to 11. For the sake of brevity, the corresponding process of the method will not be repeated here.

In summary, this application provides that when a computing node performs data reading, the storage node 220 generates the metadata of the file to be read in advance before the computing node 210 reads the file to be read, so that the computing node 210 obtains data from the storage node. 220 When reading the file to be read, the length of the file to be read, the number of slices, and the starting position of each slice in the file to be read can be determined according to the metadata of the file to be read, so as to achieve a one-time application Memory space, the purpose of multiple threads concurrently reading files, not only avoids the problem of incorrect initialization of the memory space data structure and failure of data processing due to the inability to determine the data type, but also avoids the inability to determine the number of lines in the file to be read The waste of resources caused by multiple expansions of the memory space, and the ability to read files concurrently, greatly improves the speed at which the computing node 210 reads files, and further improves the processing efficiency of big data and AI tasks.

FIG. 13 is a schematic structural diagram of a server 1300 provided by an embodiment of this application. The server 1300 may be the computing node 210 and the storage node 220 in the embodiment of FIG. 1 to FIG. 11. As shown in FIG. 13, the server 1300 includes a processor 1310, a communication interface 1320, and a memory 1330. Among them, the processor 1310, the communication interface 1320, and the memory 1330 may be connected to each other through an internal bus 1340, and may also communicate through other means such as wireless transmission. The embodiment of the present application takes the connection via the bus 1340 as an example. The bus 1340 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus. The bus 1340 can be divided into an address bus, a data bus, a control bus, and so on. For ease of representation, only one thick line is used in FIG. 13, but it does not mean that there is only one bus or one type of bus.

The processor 1310 may be constituted by at least one general-purpose processor, such as a CPU, or a combination of a CPU and a hardware chip. The above-mentioned hardware chip may be ASIC, PLD or a combination thereof. The above-mentioned PLD can be CPLD, FPGA, GAL or any combination thereof. The processor 1310 executes various types of digital storage instructions, such as software or firmware programs stored in the memory 1330, which enables the computing node 210 to provide various services. The processor 1310 may be a multi-core processor shown in FIG. 1 or a multi-CPU multi-core processor, which is not specifically limited in this application.

In the case where the server 1300 is the computing node 210, the memory 1330 is used to store program codes, which are controlled to execute by the processor 1310, so as to execute the processing steps of the computing node 210 in any of the embodiments in FIG. 1 to FIG. 11 described above. The program code may include one or more software modules, and the one or more software modules may be software units of the computing node 210 provided in the embodiment of FIG. 1, such as a metadata reading unit, a slice reading unit, etc. , The metadata reading unit is used to obtain the metadata of the file to be read from the storage node; the slice reading unit is used to create multiple threads according to the number of slices and the processing capacity of the processor of the computing node, and apply for Store the memory space of the file to be read; the slice reading unit is also used to call multiple threads according to the starting position of each slice in the file to be read, and concurrently read each slice to the memory space to obtain the file to be read document. Specifically, it can be used to execute S810-step S830 and its optional steps in the embodiment of FIG. 8 and FIG. 9, step S1001 to step S1012 and its optional steps in the embodiment of FIG. 10, and step S1101 to step in the embodiment of FIG. 11 S1113 and its optional steps can also be used to perform other steps performed by the computing node 210 described in the embodiments in FIG. 1 to FIG. 11, and details are not described herein again.

In the case where the server 1300 is the storage node 220, the memory 1330 is used to store program codes, which are controlled by the processor 1310 to execute, so as to execute the processing steps of the storage node 210 in any of the embodiments in FIG. 1 to FIG. 11 described above. The program code may include one or more software modules. The one or more software modules may be a software unit of the storage node 220 provided in the embodiment of FIG. The node 220 obtains the metadata of the file to be read according to the file to be read. The metadata of the file to be read includes the number of slices, the number of rows, and the starting position of each slice in the file to be read. . Specifically, it can be used to perform S510-step S520 and optional steps in the embodiment of FIG. 5, and can also be used to perform other steps performed by the storage node 220 described in the embodiments of FIG.

The memory 1330 may include a volatile memory (volatile memory), such as a random access memory (random access memory, RAM); the memory 1030 may also include a non-volatile memory (non-volatile memory), such as a read-only memory (read-only memory). Only memory (ROM), flash memory (flash memory), hard disk drive (HDD) or solid-state drive (SSD); memory 1330 may also include a combination of the above types. The memory also stores program code. In the case where the server 1300 is the computing node 210, it may specifically include program code for executing the steps performed by the computing node described in the embodiments of FIG. 1 to FIG. 11. The server 1300 is the storage node 220 In this case, it may specifically include program code for executing the steps performed by the storage node described in the embodiments of FIG. 1 to FIG. 11, and store the file to be read and the metadata of the file to be read.

The communication interface 1320 may be a wired interface (such as an Ethernet interface), an internal interface (such as a high-speed serial computer expansion bus (peripheral component interconnect express, PCIe) bus interface), a wired interface (such as an Ethernet interface), or a wireless interface ( For example, a cellular network interface or the use of a wireless local area network interface) to communicate with other devices or modules.

Incidentally, the present embodiment may be a common physical server implementations, e.g., the ARM server or X86 server may be a common physical servers based on a combination NFV technology virtual machine implementation, the virtual machine refers to a software simulation a complete hardware system functions, in a computer system running a full completely isolated environment, such as in the present embodiment may be implemented on a cloud computing infrastructure.

It should be noted that FIG. 13 is only a possible implementation of the embodiment of the present application. In actual applications, the server 1300 may also include more or fewer components, which is not limited here. Regarding the content that is not shown or described in the embodiments of the present application, please refer to the relevant descriptions in the foregoing embodiments of FIG. 1 to FIG. 11, which will not be repeated here.

It should be understood that the server shown in FIG. 13 may also be a computer cluster composed of at least one physical server, which is not specifically limited in this application.

FIG. 14 is a storage array 1400 provided by the present application. The storage array 1400 may be the storage node 220 of the foregoing content. The storage array 1400 includes a storage controller 1410 and at least one storage 1420, where the storage controller 1410 and the at least one storage 1420 are connected to each other through a bus 1430.

The storage controller 1410 includes one or more general-purpose processors, where the general-purpose processor can be any type of device capable of processing electronic instructions, including a CPU, a microprocessor, a microcontroller, a main processor, a controller, and an ASIC, etc. Wait. The processor 1410 executes various types of digital storage instructions, such as software or firmware programs stored in the memory 1420, which enables the storage array 1400 to provide multiple services.

The memory 1420 is used to store program codes, and is controlled by the storage controller 1410 to execute, so as to execute the processing steps of the storage node 210 in any one of the embodiments in FIG. 1 to FIG. 11 described above. The program code may include one or more software modules. The one or more software modules may be a software unit of the storage node 220 provided in the embodiment of FIG. The node 220 obtains the metadata of the file to be read according to the file to be read. The metadata of the file to be read includes the number of slices, the number of rows, and the starting position of each slice in the file to be read. . Specifically, it can be used to perform steps S510 to S520 and optional steps in the embodiment of FIG. 5, and can also be used to perform other steps performed by the storage node described in the embodiments of FIG. The memory 1420 is also used to store program data. Wherein, the program data includes the file to be read and the metadata of the file to be read. FIG. 14 takes the program code stored in the memory 1 and the program data stored in the memory n as an example for illustration, which is not limited in this application.

The memory 1420 may be a non-volatile memory, such as ROM, flash memory, HDD, or SSD memory, and may also include a combination of the foregoing types of memory. For example, the storage array 1400 may be composed of multiple HDDs or multiple SDDs, or the storage array 1400 may be composed of multiple HDDs and ROMs. Among them, at least one memory 1420 is combined in different ways with the assistance of the memory controller 1410 to form a memory group, thereby providing higher storage performance than a single memory and providing data backup technology.

It should be understood that the storage array 1400 shown in FIG. 14 may also be one or more data centers composed of at least one storage array, and the above-mentioned one or more data centers may be located at the same location, or at different locations. There are no specific restrictions.

It should be noted that FIG. 14 is only a possible implementation of the embodiment of the present application. In practical applications, the storage array 1400 may also include more or fewer components, which is not limited here. Regarding the content that is not shown or described in the embodiments of the present application, please refer to the relevant descriptions in the foregoing embodiments of FIG. 1 to FIG. 11, which will not be repeated here.

This application also provides a system including the server 1300 described in FIG. 13 and the storage array 1400 described in FIG. I won't repeat them here.

The embodiment of the present application also provides a computer-readable storage medium, which stores instructions in the computer-readable storage medium, and when it runs on a processor, the method flow shown in FIG. 1 to FIG. 11 is implemented.

The embodiment of the present application also provides a computer program product. When the computer program product runs on a processor, the method flow shown in FIG. 1 to FIG. 11 can be realized.

The foregoing embodiments may be implemented in whole or in part by software, hardware, firmware or any other combination. When implemented using software, the above-mentioned embodiments may be implemented in the form of a computer program product in whole or in part. The computer program product includes at least one computer instruction. When the computer program instructions are loaded or executed on the computer, the processes or functions according to the embodiments of the present invention are generated in whole or in part. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices. Computer instructions can be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, computer instructions can be transmitted from a website, computer, server, or data center through a cable (such as Coaxial cable, optical fiber, Digital Subscriber Line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) to transmit to another website, computer, server or data center. The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage node such as a server or a data center that includes at least one set of available media. The usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, and a magnetic tape), an optical medium (for example, a high-density digital video disc (Digital Video Disc, DVD)), or a semiconductor medium. The semiconductor medium may be an SSD.

The above are only specific embodiments of the present invention, but the protection scope of the present invention is not limited thereto. Any person skilled in the art can easily think of various equivalent modifications or changes within the technical scope disclosed in the present invention. Replacement, these modifications or replacements should all be covered within the protection scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

A data processing method, characterized in that it is applied to a data processing system, the data processing system includes a computing node and a storage node, and the method includes:

The computing node obtains metadata of the file to be read, where the metadata of the file to be read includes the number of lines of the file to be read and the starting position of each slice in the file to be read ；

The computing node concurrently reads the data of each slice according to the starting position of each slice in the file to be read;

The computing node stores the data of each slice in a memory space according to the order of the starting position of each slice in the file to be read, where the memory space is determined by the computing node according to The number of rows mentioned is obtained by application.
The method according to claim 1, wherein the metadata of the file to be read is after the storage node determines the metadata format of the file to be read according to the data type of the file to be read, Generated according to the metadata format and the file to be read, wherein the metadata format of the file to be read is different for different data types.
The method according to claim 1 or 2, wherein the metadata of the file to be read is stored in the file to be read, and the end of the file to be read includes the metadata in the file to be read. The starting position in the file to be read, where the computing node obtains metadata of the file to be read includes:

The computing node obtains the starting position of the metadata in the file to be read from the end of the file to be read;

The computing node reads the metadata of the file to be read according to the starting position of the metadata in the file to be read.
The method according to claim 1 or 2, wherein the metadata of the file to be read is stored in a designated path of the storage node.
The method according to claim 1 or 2, wherein the metadata storage location of the file to be read is the same as the storage location of the file to be read.
The method according to claim 4 or 5, wherein the metadata of the file to be read and the file to be read includes a common identifier, and the computing node acquiring the metadata of the file to be read comprises:

Acquiring, by the computing node, the common identifier of the file to be read;

The computing node obtains the metadata of the file to be read from the designated path or the storage location of the file to be read according to the common identifier of the file to be read.
The method according to any one of claims 1 to 6, wherein the metadata of the file to be read includes verification information, and the verification information is used to verify the metadata of the file to be read. Whether the data has changed after being stored in the storage node, the computing node calls multiple threads according to the starting position of each slice in the file to be read, and concurrently reads the data of each slice Before the data, the method also includes:

The computing node verifies, according to the verification information, whether the metadata of the file to be read has changed after being stored in the storage node;

In the case where the metadata of the file to be read has not changed since the metadata of the file to be read is stored in the storage node, the computing node invokes more than one slice according to the starting position of each slice in the file to be read. Threads, read the data of each slice concurrently.
The method according to any one of claims 1 to 7, wherein the metadata of the file to be read further includes a data type, and when the data type is a dense matrix, the metadata is also Includes a feature value type, the feature value type is used for the computing node to initialize the data structure of the memory space;

Before the computing node invokes multiple threads according to the starting position of each slice in the file to be read, and concurrently reads the data of each slice, the method further includes:

The computing node initializes the data structure of the memory space according to the data type.
The method according to any one of claims 1 to 8, wherein when the data type is a sparse matrix, the file to be read includes a data value, a data column index, and a row data amount, and the The metadata of the file to be read also includes a value quantity, and the value quantity is used to apply for the first memory space for storing the data value and the data column index,

Before the computing node calls multiple threads according to the starting position of each slice in the file to be read, and before concurrently reading each slice, the method further includes:

The computing node applies for the first memory space for storing the data value and the data column index according to the number of values;

The computing node applies for a second memory space for storing the amount of row data according to the number of rows, and obtains the memory space according to the first memory space and the second memory space.
The method according to claim 9, wherein in the case that the data type is a sparse matrix, the starting position of each slice in the file to be read includes the data column of each slice An index start position, a data value start position of each slice, and a row data amount start position of each slice;

The computing node storing the data of each slice in the memory space in the order of the starting position of each slice in the file to be read includes:

The computing node stores the data column index and data value of each slice according to the order of the starting position of the data column index of each slice and the order of the starting position of the data value of each slice. The first memory space stores the row data amount of each slice in the second memory space according to the sequence of the starting position of the row data amount of each slice.
The method according to any one of claims 1 to 10, wherein the metadata further includes the number of slices of the file to be read, and the computing node displays the number of slices in the file to be read according to each slice. Taking the starting position in the file and reading the data of each slice concurrently includes:

The computing node calls multiple threads to concurrently read the data of each slice, and the number of the multiple threads is less than or equal to the number of slices.
The method according to any one of claims 1 to 10, wherein the computing node concurrently reads each slice according to the starting position of each slice in the file to be read Data, including:

The computing node calls multiple threads to concurrently read the data of each slice, and the number of the multiple threads is the same as the number of the slices.
A data processing method, characterized in that it is applied to a data processing system, the data processing system includes a computing node and a storage node, and the method includes:

The storage node obtains the file to be read;

The storage node obtains metadata of the file to be read according to the file to be read. The metadata of the file to be read includes the number of slices, the number of rows, and each slice of the file to be read. In the starting position in the file to be read, the number of rows is used for the computing node to apply for memory space for storing the file to be read, and the number of slices is used for the calculation The node creates multiple threads, and the starting position of each slice in the file to be read is used for the computing node to call the multiple threads, read the data of each slice concurrently, and follow Storing the data of each slice in the memory space in the order of the starting position of each slice in the file to be read;

The storage node stores metadata of the file to be read.
The method according to claim 13, wherein the storage node parses the file to be read, and obtains metadata of the file to be read comprises:

The storage node parses the file to be read, and determines the data type of the file to be read;

The storage node determines the metadata format of the file to be read according to the data type of the file to be read, wherein the metadata format of the file to be read is different for different data types;

The storage node generates metadata of the file to be read according to the metadata format of the file to be read and the file to be read.
The method according to claim 13 or 14, wherein the storage node storing the metadata of the file to be read comprises:

The storage node stores the metadata of the file to be read in the file to be read, and the end of the file to be read includes the starting position of the metadata in the file to be read, After the computing node obtains the starting position of the metadata in the file to be read from the end of the file to be read, according to the starting position of the metadata in the file to be read To read the metadata of the file to be read.
The method according to claim 13 or 14, wherein the storage node storing the metadata of the file to be read comprises:

The storage node stores the metadata of the file to be read in a designated path of the storage node.
The method according to claim 13 or 14, wherein the storage node storing the metadata of the file to be read comprises:

The storage node stores the metadata of the file to be read in the storage location of the file to be read.
A data processing system, comprising a computing node and a storage node, wherein the computing node executes the method according to any one of claims 1 to 12, and the storage node executes the method according to any one of claims 13 to 17. The method of the claims.