CN111124685A

CN111124685A - Big data processing method and device, electronic equipment and storage medium

Info

Publication number: CN111124685A
Application number: CN201911368393.0A
Authority: CN
Inventors: 徐成海; 孙丰龙
Original assignee: Digital China Health Technologies Co ltd
Current assignee: Digital China Health Technologies Co ltd
Priority date: 2019-12-26
Filing date: 2019-12-26
Publication date: 2020-05-08

Abstract

The application provides a big data processing method and device, electronic equipment and a storage medium, and relates to the technical field of big data processing. The big data processing method is based on Python and comprises the steps of obtaining a data processing task of a hierarchical data format file HDF, creating a corresponding main process and a plurality of sub processes according to the data processing task, scheduling the plurality of sub processes through the main process to respectively obtain sample data of the HDF data for concurrent processing, and outputting the processed data to a network training model.

Description

Big data processing method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a big data processing method and apparatus, an electronic device, and a storage medium.

Background

At present, when a python is used for training and predicting a deep learning model, data needs to be read into a memory from a file system, processed on line and finally input into a deep learning framework. The data access is the storage format of the data in the hard disk, namely, the interface for accessing the data, and the online processing is that some processing is performed after the data is fetched from the memory, and then the data is sent to the deep learning network.

In the prior art, all training data is written into an HDF5 file, and usually, multithreading is started to dynamically read data from an HDF5 during program running, perform data processing, and then send the data to a deep learning network model. The method has the advantages that by using the HDF5 technology, massive data can be read efficiently, the data can be compressed, and the operation is simple.

However, with the prior art, the current HDF5 does not support multi-process reading and writing, and only can perform multi-thread reading and writing, and if a complex data processing process is encountered, the global thread lock of python makes the overall thread lock become a bottleneck of the whole operation process, resulting in a technical problem of low efficiency of big data processing.

Disclosure of Invention

The application provides a big data processing method, a big data processing device, an electronic device and a storage medium, and aims to solve the technical problems that HDF5 in the prior art does not support multi-process reading and writing, can only perform multi-thread reading and writing, and if a complex data processing process is met, a global thread lock of python makes the whole operation process bottleneck, so that big data processing efficiency is low.

In order to achieve the above object, in a first aspect, an embodiment of the present application provides a big data processing method, which includes, based on python:

acquiring a data reading task of the HDF data of the hierarchical data file;

creating a corresponding main process and a plurality of sub processes according to the data processing task;

and scheduling a plurality of subprocesses through the main process to respectively acquire sample data of the HDF data for concurrent processing, and outputting the processed data to a network training model.

Optionally, the big data processing method further includes:

before a plurality of subprocesses are scheduled by a main process to respectively acquire sample data of HDF data and carry out concurrent processing, a communication queue for the communication between the main process and the plurality of subprocesses is generated, wherein the communication queue comprises: a data request queue and a data output queue.

Optionally, the scheduling, by the main process, the multiple sub-processes to respectively obtain sample data of the HDF5 data for concurrent processing, and before outputting the processed data to the network training model, the method includes:

and storing the indexes of the sample data to be processed of each sub-process to a data request queue through the main process.

Optionally, the scheduling, by the main process, the multiple sub-processes to respectively obtain sample data of the HDF data for concurrent processing, and outputting the processed data to the network training model, includes:

scheduling each subprocess to read the index from the data request queue through the main process;

and respectively reading sample data to be processed of the HDF data of the hierarchical data file according to the indexes through the sub-process, and simultaneously processing the sample data to be processed according to a preset data processing rule.

In a second aspect, the present application also provides a big data processing apparatus, which includes, based on python:

the acquisition module is used for acquiring a data processing task of the HDF data of the hierarchical data file;

the process module is used for creating a corresponding main process and a plurality of sub processes according to the data processing task;

and the concurrent processing module is used for scheduling the plurality of sub-processes through the main process to respectively acquire sample data of the HDF data for concurrent processing, and outputting the processed data to the network training model.

Optionally, the apparatus further comprises:

the communication module is used for generating a communication queue for communication between the main process and the plurality of sub-processes, wherein the communication queue comprises: a data request queue and a data output queue.

Optionally, the apparatus further comprises:

and the storage module is used for scheduling the plurality of sub-processes through the main process to respectively obtain the sample data of the HDF database for concurrent processing, and storing the index of the sample data to be processed of each sub-process to the data request queue through the main process before outputting the processed data to the network training model.

Optionally, the concurrent processing module includes:

the reading module is used for scheduling each subprocess through the main process to read the index from the data request queue and reading sample data from the HDF data of the hierarchical data file;

and the processing module is used for respectively reading the sample data to be processed of the HDF data of the hierarchical data file according to the indexes through each sub-process and simultaneously processing the sample data to be processed according to the preset data processing rule.

In a third aspect, the present application further provides an electronic device, which includes a memory and a processor, where the memory stores a computer program, and when the computer program in the memory is read and executed by the processor, the electronic device implements the method in the first aspect.

In a fourth aspect, the present application also proposes a computer-readable storage medium, on which a computer program is stored, which, when read and executed by a processor, implements the method of the first aspect.

Compared with the prior art, the method has the following beneficial effects:

the embodiment of the application provides a big data processing method and a big data processing device, wherein a main process schedules a plurality of subprocesses to respectively obtain sample data of HDF data for concurrent processing, the processed data is output to a network training model, the plurality of subprocesses run simultaneously, and each process has a lock and is not interfered with each other when running, so that the global interpreter lock of the main process is not required to be used in a competitive mode, the architecture can maximally utilize the resources of a multi-core CPU, and the data processing access and processing speed are improved.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

In order to more clearly explain the technical solutions of the present application, the drawings needed for the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also derive other related drawings from these drawings without inventive effort.

Fig. 1 is a schematic flowchart illustrating a big data processing method according to an embodiment of the present application;

fig. 2 is a schematic diagram illustrating an architecture of a big data processing method according to an embodiment of the present application;

FIG. 3 is a schematic flow chart illustrating another big data processing method according to an embodiment of the present application;

FIG. 4 is a functional block diagram of a big data processing apparatus according to an embodiment of the present disclosure;

FIG. 5 is a block diagram of another embodiment of a big data processing apparatus;

FIG. 6 is a block diagram of another embodiment of a big data processing apparatus;

fig. 7 is a functional block diagram of a concurrency processing module according to an embodiment of the present application;

fig. 8 is a functional module schematic diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Fig. 1 is a schematic flow diagram of a big data processing method according to an embodiment of the present application, and fig. 2 is a schematic diagram of a big data processing architecture according to an embodiment of the present application, where as shown in fig. 1 and fig. 2, the big data processing method may be applied to a computer, an intelligent terminal, a server, and other devices, and the method according to the present embodiment is based on python, and includes:

and step S101, acquiring a data processing task of the HDF data of the hierarchical data file.

Specifically, when the deep learning model is trained and predicted by using python, data needs to be read into a memory from a file system for online processing, because HDF data can be compressed, the storage space of the file system is reduced, for example: HDF5 data can be used by medical image samples, geographic image samples and life image samples waiting for processing, so as to save storage space. Then, a data processing task of the HDF5 data is obtained, which may be, for example: it should be noted that, although the HDF5 data is described in detail in the present embodiment, the processing tasks such as resampling processing and label conversion on the image samples are not limited to this.

And step S102, creating a corresponding main process and a plurality of sub processes according to the data processing task.

Specifically, a corresponding main process and a plurality of sub processes can be created according to the configuration of the user and the data processing task.

And step S103, scheduling a plurality of sub-processes through the main process to respectively acquire sample data of the HDF data for concurrent processing, and outputting the processed data to a network training model.

In particular, although HDF5 can read and write the same file in multiple threads, the multiple threads of python cannot fully exploit the performance of a multi-core CPU due to the presence of a global interpreter lock, therefore, by adopting a multi-process processing mode, a plurality of sub-processes are scheduled by the main process to respectively acquire sample data of the HDF5 data, because N sub-processes run simultaneously, each process has a lock of the process, the processes do not interfere with each other when running, the global interpreter lock of the main process does not need to be used competitively, the processed data is output to the network training model, the architecture can maximally utilize the resources of a multi-core CPU, the limit of the global interpreter lock does not exist among the processes, the resources do not need to be competed among the processes, for example, a server with 10 cores starts 10 processes, each process runs independently, the CPU utilization rate reaches 100%, and the real multi-thread effect is achieved.

The embodiment of the application provides a big data processing method, which comprises the steps of obtaining a data processing task of HDF data of a hierarchical data file, creating a corresponding main process and a plurality of sub processes according to the data processing task, scheduling the plurality of sub processes through the main process to respectively obtain sample data of the HDF data for concurrent processing, and outputting the processed data to a network training model and simultaneously operating the plurality of sub processes.

Optionally, before the main process schedules the multiple sub-processes to respectively obtain sample data of the HDF data and perform concurrent processing, the method further includes:

generating a communication queue for communication between a main process and a plurality of sub-processes, wherein the communication queue comprises: a data request queue and a data output queue.

Specifically, after the main program is started, because communication between processes is required, it first generates a queue for communication between two processes: a data request queue and a data output queue. The data request queue is used for storing indexes of data requested by the main process to the subprocess, the main process puts the indexes of the data required to be requested into the queue, the subprocess reads the indexes of the data required to be processed from the queue and runs a program in the space of the main process, on one hand, the indexes of the data required to be obtained are written into the data request queue, on the other hand, the processed data are taken out from the data output queue and sent to a network model for training or inference.

Optionally, the scheduling, by the main process, the multiple sub-processes to respectively obtain sample data of the HDF database for concurrent processing, and before outputting the processed data to the network training model, the method includes:

Specifically, in order to match the data processing task with each sub-process, an index of sample data to be processed of each sub-process is stored in the data request queue through the main process, wherein the index indicates a position of a sample in the HDF data, and each sub-process can read the corresponding sample data to be processed through the position.

Optionally, fig. 3 is a schematic flow diagram of another big data processing method provided in an embodiment of the present application, and as shown in fig. 3, the method includes that a main process schedules a plurality of sub-processes to respectively obtain sample data of an HDF database for concurrent processing, and outputs the processed data to a network training model, where the method includes:

step S201, each subprocess is scheduled by the main process to read indexes from the data request queue;

step S202, respectively reading sample data to be processed of the HDF data of the hierarchical data file according to the indexes through each sub-process, and simultaneously processing the sample data to be processed according to a preset data processing rule.

Specifically, each sub-process first opens the HDF5 file and then loops through the index in the data request queue. After the index is read, the subprocess firstly obtains original data from the HDF5 file, and then sequentially transmits the original data into a user-defined data processing pipeline. After all processing is completed, the subprocess sends the data into a data output queue, outputs the processed data to a network training model, and then reads the next data index for repeated circulation.

It should be noted that, under the Linux operating system, the child process inherits the file descriptor table of the parent process, so the HDF5 file cannot be opened in the parent process, otherwise, the child process cannot read the HDF5 file. When the child process is started, only the HDF5 file path can be transmitted, and then each process opens the HDF5 file.

Application examples

For the above method, the present embodiment will be described by test examples as follows:

the testing hardware configuration may be an eosin X780-G30 server, CPU 40 core Intel (R) Xeon (R) Silver4114CPU @2.20GHz, 128G DDR4 memory, hard disk 16T SATA3(6Gbps), GPU 8 block NVIDIA PCI-ETesla V100.

The test environment is Ubuntu 16.04, the Python version is 3.6, the HDF5 read-write library is tables2.4, and the numpy version is 1.16.

The test data is data of a kidney cancer CT (computed tomography) slice and a label of 210 cases, the CT and the label are three-dimensional data numpy. array, the array size is generally 512x512x100, a compression algorithm is adopted in an HDF5 file, the storage space of a file system is reduced, and the size of the HDF5 file is 18G finally.

The test procedure was to randomly read all 210 samples of the HDF5 file and then process them on the CPU.

Wherein, CPU non-intensive processing: resampling to an array of size (128,128,128) is performed, and then converting the multi-valued label to a binary label (i.e., adding one pass).

CPU intensive processing: first resampled to an array of size (128,128,128), then traversed through all array elements, and finally converted to binary labels.

In addition to the above tests, HDF5 load data was compared to sample files in the format of npy loaded directly with numpy library.

Where multiple processes and multiple threads both use queues to reduce latency, the number of processes or threads is 10 and the size of the queue is 20.

For actual environments such as thread/process, CPU intensive/non-intensive, etc., the actual environments are divided into several groups of test cases, and table 1 is a test case result table of the big data processing method provided by this embodiment, as shown in table 1 below:

TABLE 1

Serial number	Processing mode	CPU non-intensive	CPU intensive type
				1	Single process (HDF5)	222 second	1610 seconds
2	Multithreading (N10, HDF5)	342 second	1584 seconds
				3	Multiple process (N10, HDF5)	56 seconds	221 seconds
4	Multiple process (N10, npy)	889 seconds	——

As can be seen from table 1 above, there is no advantage in multithreading in a Python environment, the speed of loading data by using HDF5 is higher than that of other methods, and the processing speed is greatly improved by combining multiple processes with HDF 5.

Fig. 4 is a functional module schematic diagram of a big data processing apparatus according to an embodiment of the present application, please refer to fig. 4, it should be noted that the basic principle and the generated technical effect of the big data processing apparatus 300 according to the embodiment are the same as those of the corresponding method embodiment, and for brief description, the corresponding contents in the method embodiment may be referred to for the parts not mentioned in the embodiment. The big data processing apparatus 300 includes:

the acquisition module 310 is configured to acquire a data processing task of HDF data of the hierarchical data file;

a process module 320, configured to create a corresponding main process and multiple sub processes according to the data processing task;

and the concurrent processing module 330 is configured to schedule a plurality of sub-processes through the main process to respectively obtain sample data of the HDF data for concurrent processing, and output the processed data to the network training model.

The above-mentioned apparatus is used for executing the method provided by the foregoing embodiment, and the implementation principle and technical effect are similar, which are not described herein again.

Optionally, fig. 5 is a schematic functional module diagram of another big data processing apparatus according to an embodiment of the present application, please refer to fig. 5, in which the big data processing apparatus 300 further includes:

a communication module 340, configured to generate a communication queue for a main process and a plurality of sub-processes to communicate with each other, where the communication queue includes: a data request queue and a data output queue.

Optionally, fig. 6 is a schematic functional module diagram of another big data processing apparatus according to an embodiment of the present application, please refer to fig. 6, where the big data processing apparatus 300 further includes:

and the storage module 350 is configured to schedule the multiple sub-processes through the main process to respectively obtain sample data of the HDF data for concurrent processing, and store an index of the sample data to be processed of each sub-process to the data request queue through the main process before outputting the processed data to the network training model.

Optionally, fig. 7 is a functional module schematic diagram of a concurrency processing module according to an embodiment of the present application, please refer to fig. 7, where the concurrency processing module 330 includes:

the reading module 331, configured to schedule each sub-process to read an index from the data request queue through the main process;

and the processing module 332 is configured to read, by each sub-process, sample data to be processed of the HDF data of the hierarchical data file according to the index, and process the sample data to be processed simultaneously according to a preset data processing rule.

These above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), among others. For another example, when one of the above modules is implemented in the form of a processing element scheduler code, the processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of calling program code. For another example, these modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).

Fig. 8 is a functional module diagram of an electronic device according to an embodiment of the present application, please refer to fig. 8, the electronic device may include a processor 1001 and a memory 1002, and the processor 1001 may call a computer program stored in the memory 1002. When read and executed by the processor 1001, the computer program may implement the above-described method embodiments. The specific implementation and technical effects are similar, and are not described herein again.

Optionally, the present application further provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is read and executed by a processor, the above method embodiments can be implemented.

In the several embodiments provided in the present application, it should be understood that the above-described apparatus embodiments are merely illustrative, and the disclosed apparatus and method may be implemented in other ways. For example, the division of the unit is only a logical function division, and in actual implementation, there may be another division manner, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or may not be executed, for example, each unit may be integrated into one processing unit, each unit may exist alone physically, or two or more units may be integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A big data processing method is characterized in that based on python, the method comprises the following steps:

acquiring a data processing task of the HDF data of the hierarchical data file;

and scheduling a plurality of sub-processes through the main process to respectively acquire sample data of the HDF data for concurrent processing, and outputting the processed data to a network training model.

2. The method according to claim 1, wherein before the main process schedules the plurality of sub-processes to respectively obtain sample data of the HDF data for concurrent processing, the method further comprises:

generating a communication queue for the main process and the plurality of sub-processes to communicate with each other, wherein the communication queue comprises: a data request queue and a data output queue.

3. The method according to claim 2, wherein the scheduling, by the main process, the plurality of sub-processes to respectively obtain sample data of the HDF data for concurrent processing, and before outputting the processed data to a network training model, comprises:

and storing the index of the sample data to be processed of each sub-process to the data request queue through the main process.

4. The method according to claim 3, wherein the scheduling, by the main process, the plurality of sub-processes to respectively obtain sample data of the HDF data for concurrent processing, and outputting the processed data to a network training model includes:

scheduling each of the sub-processes by the main process to read the index from the data request queue;

and respectively reading sample data to be processed of the HDF data of the hierarchical data file according to the indexes through each subprocess, and simultaneously processing the sample data to be processed according to a preset data processing rule.

5. A big data processing apparatus, python based, comprising:

and the concurrent processing module is used for scheduling the plurality of sub-processes through the main process to respectively acquire sample data of the HDF data for concurrent processing, and outputting the processed data to a network training model.

6. The apparatus of claim 5, further comprising:

a communication module, configured to generate a communication queue for the main process and the plurality of sub-processes to communicate with each other, where the communication queue includes: a data request queue and a data output queue.

7. The apparatus of claim 6, further comprising:

and the storage module is used for scheduling the plurality of sub-processes through the main process to respectively acquire the sample data of the HDF data for concurrent processing, and storing the index of the sample data to be processed of each sub-process to the data request queue through the main process before outputting the processed data to a network training model.

8. The apparatus of claim 7, wherein the concurrent processing module comprises:

a reading module, configured to schedule each of the sub-processes through the main process to read the index from the data request queue;

and the processing module is used for reading sample data to be processed of the HDF data of the hierarchical data file through each subprocess according to the index and simultaneously processing the sample data to be processed according to a preset data processing rule.

9. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and the processor is configured to execute the computer program to perform the steps of the big data processing method according to any of claims 1 to 4.

10. A storage medium, characterized in that the storage medium has stored thereon a computer program which, when being executed by a processor, performs the steps of the big data processing method according to any of claims 1 to 4.