CN111124685A - Big data processing method and device, electronic equipment and storage medium - Google Patents
Big data processing method and device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN111124685A CN111124685A CN201911368393.0A CN201911368393A CN111124685A CN 111124685 A CN111124685 A CN 111124685A CN 201911368393 A CN201911368393 A CN 201911368393A CN 111124685 A CN111124685 A CN 111124685A
- Authority
- CN
- China
- Prior art keywords
- data
- sub
- main process
- processing
- processes
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 20
- 238000000034 method Methods 0.000 claims abstract description 175
- 230000008569 process Effects 0.000 claims abstract description 102
- 238000012545 processing Methods 0.000 claims abstract description 102
- 238000012549 training Methods 0.000 claims abstract description 23
- 238000004891 communication Methods 0.000 claims description 20
- 238000004590 computer program Methods 0.000 claims description 10
- 238000010586 diagram Methods 0.000 description 14
- 230000000694 effects Effects 0.000 description 7
- 238000012360 testing method Methods 0.000 description 7
- 230000008901 benefit Effects 0.000 description 5
- 230000009471 action Effects 0.000 description 3
- 238000002591 computed tomography Methods 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 238000012952 Resampling Methods 0.000 description 2
- 238000013136 deep learning model Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 101100498818 Arabidopsis thaliana DDR4 gene Proteins 0.000 description 1
- 208000008839 Kidney Neoplasms Diseases 0.000 description 1
- 206010038389 Renal cancer Diseases 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000002860 competitive effect Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 235000019800 disodium phosphate Nutrition 0.000 description 1
- YQGOJNYOYNNSMM-UHFFFAOYSA-N eosin Chemical compound [Na+].OC(=O)C1=CC=CC=C1C1=C2C=C(Br)C(=O)C(Br)=C2OC2=C(Br)C(O)=C(Br)C=C21 YQGOJNYOYNNSMM-UHFFFAOYSA-N 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 201000010982 kidney cancer Diseases 0.000 description 1
- 238000010998 test method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/54—Interprogram communication
- G06F9/546—Message passing systems or structures, e.g. queues
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application provides a big data processing method and device, electronic equipment and a storage medium, and relates to the technical field of big data processing. The big data processing method is based on Python and comprises the steps of obtaining a data processing task of a hierarchical data format file HDF, creating a corresponding main process and a plurality of sub processes according to the data processing task, scheduling the plurality of sub processes through the main process to respectively obtain sample data of the HDF data for concurrent processing, and outputting the processed data to a network training model.
Description
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a big data processing method and apparatus, an electronic device, and a storage medium.
Background
At present, when a python is used for training and predicting a deep learning model, data needs to be read into a memory from a file system, processed on line and finally input into a deep learning framework. The data access is the storage format of the data in the hard disk, namely, the interface for accessing the data, and the online processing is that some processing is performed after the data is fetched from the memory, and then the data is sent to the deep learning network.
In the prior art, all training data is written into an HDF5 file, and usually, multithreading is started to dynamically read data from an HDF5 during program running, perform data processing, and then send the data to a deep learning network model. The method has the advantages that by using the HDF5 technology, massive data can be read efficiently, the data can be compressed, and the operation is simple.
However, with the prior art, the current HDF5 does not support multi-process reading and writing, and only can perform multi-thread reading and writing, and if a complex data processing process is encountered, the global thread lock of python makes the overall thread lock become a bottleneck of the whole operation process, resulting in a technical problem of low efficiency of big data processing.
Disclosure of Invention
The application provides a big data processing method, a big data processing device, an electronic device and a storage medium, and aims to solve the technical problems that HDF5 in the prior art does not support multi-process reading and writing, can only perform multi-thread reading and writing, and if a complex data processing process is met, a global thread lock of python makes the whole operation process bottleneck, so that big data processing efficiency is low.
In order to achieve the above object, in a first aspect, an embodiment of the present application provides a big data processing method, which includes, based on python:
acquiring a data reading task of the HDF data of the hierarchical data file;
creating a corresponding main process and a plurality of sub processes according to the data processing task;
and scheduling a plurality of subprocesses through the main process to respectively acquire sample data of the HDF data for concurrent processing, and outputting the processed data to a network training model.
Optionally, the big data processing method further includes:
before a plurality of subprocesses are scheduled by a main process to respectively acquire sample data of HDF data and carry out concurrent processing, a communication queue for the communication between the main process and the plurality of subprocesses is generated, wherein the communication queue comprises: a data request queue and a data output queue.
Optionally, the scheduling, by the main process, the multiple sub-processes to respectively obtain sample data of the HDF5 data for concurrent processing, and before outputting the processed data to the network training model, the method includes:
and storing the indexes of the sample data to be processed of each sub-process to a data request queue through the main process.
Optionally, the scheduling, by the main process, the multiple sub-processes to respectively obtain sample data of the HDF data for concurrent processing, and outputting the processed data to the network training model, includes:
scheduling each subprocess to read the index from the data request queue through the main process;
and respectively reading sample data to be processed of the HDF data of the hierarchical data file according to the indexes through the sub-process, and simultaneously processing the sample data to be processed according to a preset data processing rule.
In a second aspect, the present application also provides a big data processing apparatus, which includes, based on python:
the acquisition module is used for acquiring a data processing task of the HDF data of the hierarchical data file;
the process module is used for creating a corresponding main process and a plurality of sub processes according to the data processing task;
and the concurrent processing module is used for scheduling the plurality of sub-processes through the main process to respectively acquire sample data of the HDF data for concurrent processing, and outputting the processed data to the network training model.
Optionally, the apparatus further comprises:
the communication module is used for generating a communication queue for communication between the main process and the plurality of sub-processes, wherein the communication queue comprises: a data request queue and a data output queue.
Optionally, the apparatus further comprises:
and the storage module is used for scheduling the plurality of sub-processes through the main process to respectively obtain the sample data of the HDF database for concurrent processing, and storing the index of the sample data to be processed of each sub-process to the data request queue through the main process before outputting the processed data to the network training model.
Optionally, the concurrent processing module includes:
the reading module is used for scheduling each subprocess through the main process to read the index from the data request queue and reading sample data from the HDF data of the hierarchical data file;
and the processing module is used for respectively reading the sample data to be processed of the HDF data of the hierarchical data file according to the indexes through each sub-process and simultaneously processing the sample data to be processed according to the preset data processing rule.
In a third aspect, the present application further provides an electronic device, which includes a memory and a processor, where the memory stores a computer program, and when the computer program in the memory is read and executed by the processor, the electronic device implements the method in the first aspect.
In a fourth aspect, the present application also proposes a computer-readable storage medium, on which a computer program is stored, which, when read and executed by a processor, implements the method of the first aspect.
Compared with the prior art, the method has the following beneficial effects:
the embodiment of the application provides a big data processing method and a big data processing device, wherein a main process schedules a plurality of subprocesses to respectively obtain sample data of HDF data for concurrent processing, the processed data is output to a network training model, the plurality of subprocesses run simultaneously, and each process has a lock and is not interfered with each other when running, so that the global interpreter lock of the main process is not required to be used in a competitive mode, the architecture can maximally utilize the resources of a multi-core CPU, and the data processing access and processing speed are improved.
Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
In order to more clearly explain the technical solutions of the present application, the drawings needed for the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also derive other related drawings from these drawings without inventive effort.
Fig. 1 is a schematic flowchart illustrating a big data processing method according to an embodiment of the present application;
fig. 2 is a schematic diagram illustrating an architecture of a big data processing method according to an embodiment of the present application;
FIG. 3 is a schematic flow chart illustrating another big data processing method according to an embodiment of the present application;
FIG. 4 is a functional block diagram of a big data processing apparatus according to an embodiment of the present disclosure;
FIG. 5 is a block diagram of another embodiment of a big data processing apparatus;
FIG. 6 is a block diagram of another embodiment of a big data processing apparatus;
fig. 7 is a functional block diagram of a concurrency processing module according to an embodiment of the present application;
fig. 8 is a functional module schematic diagram of an electronic device according to an embodiment of the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Fig. 1 is a schematic flow diagram of a big data processing method according to an embodiment of the present application, and fig. 2 is a schematic diagram of a big data processing architecture according to an embodiment of the present application, where as shown in fig. 1 and fig. 2, the big data processing method may be applied to a computer, an intelligent terminal, a server, and other devices, and the method according to the present embodiment is based on python, and includes:
and step S101, acquiring a data processing task of the HDF data of the hierarchical data file.
Specifically, when the deep learning model is trained and predicted by using python, data needs to be read into a memory from a file system for online processing, because HDF data can be compressed, the storage space of the file system is reduced, for example: HDF5 data can be used by medical image samples, geographic image samples and life image samples waiting for processing, so as to save storage space. Then, a data processing task of the HDF5 data is obtained, which may be, for example: it should be noted that, although the HDF5 data is described in detail in the present embodiment, the processing tasks such as resampling processing and label conversion on the image samples are not limited to this.
And step S102, creating a corresponding main process and a plurality of sub processes according to the data processing task.
Specifically, a corresponding main process and a plurality of sub processes can be created according to the configuration of the user and the data processing task.
And step S103, scheduling a plurality of sub-processes through the main process to respectively acquire sample data of the HDF data for concurrent processing, and outputting the processed data to a network training model.
In particular, although HDF5 can read and write the same file in multiple threads, the multiple threads of python cannot fully exploit the performance of a multi-core CPU due to the presence of a global interpreter lock, therefore, by adopting a multi-process processing mode, a plurality of sub-processes are scheduled by the main process to respectively acquire sample data of the HDF5 data, because N sub-processes run simultaneously, each process has a lock of the process, the processes do not interfere with each other when running, the global interpreter lock of the main process does not need to be used competitively, the processed data is output to the network training model, the architecture can maximally utilize the resources of a multi-core CPU, the limit of the global interpreter lock does not exist among the processes, the resources do not need to be competed among the processes, for example, a server with 10 cores starts 10 processes, each process runs independently, the CPU utilization rate reaches 100%, and the real multi-thread effect is achieved.
The embodiment of the application provides a big data processing method, which comprises the steps of obtaining a data processing task of HDF data of a hierarchical data file, creating a corresponding main process and a plurality of sub processes according to the data processing task, scheduling the plurality of sub processes through the main process to respectively obtain sample data of the HDF data for concurrent processing, and outputting the processed data to a network training model and simultaneously operating the plurality of sub processes.
Optionally, before the main process schedules the multiple sub-processes to respectively obtain sample data of the HDF data and perform concurrent processing, the method further includes:
generating a communication queue for communication between a main process and a plurality of sub-processes, wherein the communication queue comprises: a data request queue and a data output queue.
Specifically, after the main program is started, because communication between processes is required, it first generates a queue for communication between two processes: a data request queue and a data output queue. The data request queue is used for storing indexes of data requested by the main process to the subprocess, the main process puts the indexes of the data required to be requested into the queue, the subprocess reads the indexes of the data required to be processed from the queue and runs a program in the space of the main process, on one hand, the indexes of the data required to be obtained are written into the data request queue, on the other hand, the processed data are taken out from the data output queue and sent to a network model for training or inference.
Optionally, the scheduling, by the main process, the multiple sub-processes to respectively obtain sample data of the HDF database for concurrent processing, and before outputting the processed data to the network training model, the method includes:
and storing the indexes of the sample data to be processed of each sub-process to a data request queue through the main process.
Specifically, in order to match the data processing task with each sub-process, an index of sample data to be processed of each sub-process is stored in the data request queue through the main process, wherein the index indicates a position of a sample in the HDF data, and each sub-process can read the corresponding sample data to be processed through the position.
Optionally, fig. 3 is a schematic flow diagram of another big data processing method provided in an embodiment of the present application, and as shown in fig. 3, the method includes that a main process schedules a plurality of sub-processes to respectively obtain sample data of an HDF database for concurrent processing, and outputs the processed data to a network training model, where the method includes:
step S201, each subprocess is scheduled by the main process to read indexes from the data request queue;
step S202, respectively reading sample data to be processed of the HDF data of the hierarchical data file according to the indexes through each sub-process, and simultaneously processing the sample data to be processed according to a preset data processing rule.
Specifically, each sub-process first opens the HDF5 file and then loops through the index in the data request queue. After the index is read, the subprocess firstly obtains original data from the HDF5 file, and then sequentially transmits the original data into a user-defined data processing pipeline. After all processing is completed, the subprocess sends the data into a data output queue, outputs the processed data to a network training model, and then reads the next data index for repeated circulation.
It should be noted that, under the Linux operating system, the child process inherits the file descriptor table of the parent process, so the HDF5 file cannot be opened in the parent process, otherwise, the child process cannot read the HDF5 file. When the child process is started, only the HDF5 file path can be transmitted, and then each process opens the HDF5 file.
Application examples
For the above method, the present embodiment will be described by test examples as follows:
the testing hardware configuration may be an eosin X780-G30 server, CPU 40 core Intel (R) Xeon (R) Silver4114CPU @2.20GHz, 128G DDR4 memory, hard disk 16T SATA3(6Gbps), GPU 8 block NVIDIA PCI-ETesla V100.
The test environment is Ubuntu 16.04, the Python version is 3.6, the HDF5 read-write library is tables2.4, and the numpy version is 1.16.
The test data is data of a kidney cancer CT (computed tomography) slice and a label of 210 cases, the CT and the label are three-dimensional data numpy. array, the array size is generally 512x512x100, a compression algorithm is adopted in an HDF5 file, the storage space of a file system is reduced, and the size of the HDF5 file is 18G finally.
The test procedure was to randomly read all 210 samples of the HDF5 file and then process them on the CPU.
Wherein, CPU non-intensive processing: resampling to an array of size (128,128,128) is performed, and then converting the multi-valued label to a binary label (i.e., adding one pass).
CPU intensive processing: first resampled to an array of size (128,128,128), then traversed through all array elements, and finally converted to binary labels.
In addition to the above tests, HDF5 load data was compared to sample files in the format of npy loaded directly with numpy library.
Where multiple processes and multiple threads both use queues to reduce latency, the number of processes or threads is 10 and the size of the queue is 20.
For actual environments such as thread/process, CPU intensive/non-intensive, etc., the actual environments are divided into several groups of test cases, and table 1 is a test case result table of the big data processing method provided by this embodiment, as shown in table 1 below:
TABLE 1
Serial number | Processing mode | CPU non-intensive | CPU intensive type |
1 | Single process (HDF5) | 222 second | 1610 seconds |
2 | Multithreading (N10, HDF5) | 342 second | 1584 seconds |
3 | Multiple process (N10, HDF5) | 56 seconds | 221 seconds |
4 | Multiple process (N10, npy) | 889 seconds | —— |
As can be seen from table 1 above, there is no advantage in multithreading in a Python environment, the speed of loading data by using HDF5 is higher than that of other methods, and the processing speed is greatly improved by combining multiple processes with HDF 5.
Fig. 4 is a functional module schematic diagram of a big data processing apparatus according to an embodiment of the present application, please refer to fig. 4, it should be noted that the basic principle and the generated technical effect of the big data processing apparatus 300 according to the embodiment are the same as those of the corresponding method embodiment, and for brief description, the corresponding contents in the method embodiment may be referred to for the parts not mentioned in the embodiment. The big data processing apparatus 300 includes:
the acquisition module 310 is configured to acquire a data processing task of HDF data of the hierarchical data file;
a process module 320, configured to create a corresponding main process and multiple sub processes according to the data processing task;
and the concurrent processing module 330 is configured to schedule a plurality of sub-processes through the main process to respectively obtain sample data of the HDF data for concurrent processing, and output the processed data to the network training model.
The above-mentioned apparatus is used for executing the method provided by the foregoing embodiment, and the implementation principle and technical effect are similar, which are not described herein again.
Optionally, fig. 5 is a schematic functional module diagram of another big data processing apparatus according to an embodiment of the present application, please refer to fig. 5, in which the big data processing apparatus 300 further includes:
a communication module 340, configured to generate a communication queue for a main process and a plurality of sub-processes to communicate with each other, where the communication queue includes: a data request queue and a data output queue.
The above-mentioned apparatus is used for executing the method provided by the foregoing embodiment, and the implementation principle and technical effect are similar, which are not described herein again.
Optionally, fig. 6 is a schematic functional module diagram of another big data processing apparatus according to an embodiment of the present application, please refer to fig. 6, where the big data processing apparatus 300 further includes:
and the storage module 350 is configured to schedule the multiple sub-processes through the main process to respectively obtain sample data of the HDF data for concurrent processing, and store an index of the sample data to be processed of each sub-process to the data request queue through the main process before outputting the processed data to the network training model.
The above-mentioned apparatus is used for executing the method provided by the foregoing embodiment, and the implementation principle and technical effect are similar, which are not described herein again.
Optionally, fig. 7 is a functional module schematic diagram of a concurrency processing module according to an embodiment of the present application, please refer to fig. 7, where the concurrency processing module 330 includes:
the reading module 331, configured to schedule each sub-process to read an index from the data request queue through the main process;
and the processing module 332 is configured to read, by each sub-process, sample data to be processed of the HDF data of the hierarchical data file according to the index, and process the sample data to be processed simultaneously according to a preset data processing rule.
The above-mentioned apparatus is used for executing the method provided by the foregoing embodiment, and the implementation principle and technical effect are similar, which are not described herein again.
These above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), among others. For another example, when one of the above modules is implemented in the form of a processing element scheduler code, the processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of calling program code. For another example, these modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).
Fig. 8 is a functional module diagram of an electronic device according to an embodiment of the present application, please refer to fig. 8, the electronic device may include a processor 1001 and a memory 1002, and the processor 1001 may call a computer program stored in the memory 1002. When read and executed by the processor 1001, the computer program may implement the above-described method embodiments. The specific implementation and technical effects are similar, and are not described herein again.
Optionally, the present application further provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is read and executed by a processor, the above method embodiments can be implemented.
In the several embodiments provided in the present application, it should be understood that the above-described apparatus embodiments are merely illustrative, and the disclosed apparatus and method may be implemented in other ways. For example, the division of the unit is only a logical function division, and in actual implementation, there may be another division manner, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or may not be executed, for example, each unit may be integrated into one processing unit, each unit may exist alone physically, or two or more units may be integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.
Claims (10)
1. A big data processing method is characterized in that based on python, the method comprises the following steps:
acquiring a data processing task of the HDF data of the hierarchical data file;
creating a corresponding main process and a plurality of sub processes according to the data processing task;
and scheduling a plurality of sub-processes through the main process to respectively acquire sample data of the HDF data for concurrent processing, and outputting the processed data to a network training model.
2. The method according to claim 1, wherein before the main process schedules the plurality of sub-processes to respectively obtain sample data of the HDF data for concurrent processing, the method further comprises:
generating a communication queue for the main process and the plurality of sub-processes to communicate with each other, wherein the communication queue comprises: a data request queue and a data output queue.
3. The method according to claim 2, wherein the scheduling, by the main process, the plurality of sub-processes to respectively obtain sample data of the HDF data for concurrent processing, and before outputting the processed data to a network training model, comprises:
and storing the index of the sample data to be processed of each sub-process to the data request queue through the main process.
4. The method according to claim 3, wherein the scheduling, by the main process, the plurality of sub-processes to respectively obtain sample data of the HDF data for concurrent processing, and outputting the processed data to a network training model includes:
scheduling each of the sub-processes by the main process to read the index from the data request queue;
and respectively reading sample data to be processed of the HDF data of the hierarchical data file according to the indexes through each subprocess, and simultaneously processing the sample data to be processed according to a preset data processing rule.
5. A big data processing apparatus, python based, comprising:
the acquisition module is used for acquiring a data processing task of the HDF data of the hierarchical data file;
the process module is used for creating a corresponding main process and a plurality of sub processes according to the data processing task;
and the concurrent processing module is used for scheduling the plurality of sub-processes through the main process to respectively acquire sample data of the HDF data for concurrent processing, and outputting the processed data to a network training model.
6. The apparatus of claim 5, further comprising:
a communication module, configured to generate a communication queue for the main process and the plurality of sub-processes to communicate with each other, where the communication queue includes: a data request queue and a data output queue.
7. The apparatus of claim 6, further comprising:
and the storage module is used for scheduling the plurality of sub-processes through the main process to respectively acquire the sample data of the HDF data for concurrent processing, and storing the index of the sample data to be processed of each sub-process to the data request queue through the main process before outputting the processed data to a network training model.
8. The apparatus of claim 7, wherein the concurrent processing module comprises:
a reading module, configured to schedule each of the sub-processes through the main process to read the index from the data request queue;
and the processing module is used for reading sample data to be processed of the HDF data of the hierarchical data file through each subprocess according to the index and simultaneously processing the sample data to be processed according to a preset data processing rule.
9. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and the processor is configured to execute the computer program to perform the steps of the big data processing method according to any of claims 1 to 4.
10. A storage medium, characterized in that the storage medium has stored thereon a computer program which, when being executed by a processor, performs the steps of the big data processing method according to any of claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911368393.0A CN111124685A (en) | 2019-12-26 | 2019-12-26 | Big data processing method and device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911368393.0A CN111124685A (en) | 2019-12-26 | 2019-12-26 | Big data processing method and device, electronic equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111124685A true CN111124685A (en) | 2020-05-08 |
Family
ID=70503185
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911368393.0A Pending CN111124685A (en) | 2019-12-26 | 2019-12-26 | Big data processing method and device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111124685A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112346879A (en) * | 2020-11-06 | 2021-02-09 | 网易(杭州)网络有限公司 | Process management method and device, computer equipment and storage medium |
CN113326139A (en) * | 2021-06-28 | 2021-08-31 | 上海商汤科技开发有限公司 | Task processing method, device, equipment and storage medium |
CN113391909A (en) * | 2021-06-28 | 2021-09-14 | 上海商汤科技开发有限公司 | Process creation method and device, electronic equipment and storage medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050039049A1 (en) * | 2003-08-14 | 2005-02-17 | International Business Machines Corporation | Method and apparatus for a multiple concurrent writer file system |
CN103744643A (en) * | 2014-01-10 | 2014-04-23 | 浪潮(北京)电子信息产业有限公司 | Method and device for structuring a plurality of nodes parallel under multithreaded program |
CN104268229A (en) * | 2014-09-26 | 2015-01-07 | 北京金山安全软件有限公司 | Resource obtaining method and device based on multi-process browser |
CN105337755A (en) * | 2014-08-08 | 2016-02-17 | 阿里巴巴集团控股有限公司 | Master-slave architecture server, service processing method thereof and service processing system thereof |
CN109144741A (en) * | 2017-06-13 | 2019-01-04 | 广东神马搜索科技有限公司 | The method, apparatus and electronic equipment of interprocess communication |
CN109922319A (en) * | 2019-03-26 | 2019-06-21 | 重庆英卡电子有限公司 | RTSP agreement multiple video strems Parallel preconditioning method based on multi-core CPU |
CN110162452A (en) * | 2019-04-30 | 2019-08-23 | 广州微算互联信息技术有限公司 | A kind of analog detection method, device and storage medium for testing and control service |
CN110413386A (en) * | 2019-06-27 | 2019-11-05 | 深圳市富途网络科技有限公司 | Multiprocessing method, apparatus, terminal device and computer readable storage medium |
-
2019
- 2019-12-26 CN CN201911368393.0A patent/CN111124685A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050039049A1 (en) * | 2003-08-14 | 2005-02-17 | International Business Machines Corporation | Method and apparatus for a multiple concurrent writer file system |
CN103744643A (en) * | 2014-01-10 | 2014-04-23 | 浪潮(北京)电子信息产业有限公司 | Method and device for structuring a plurality of nodes parallel under multithreaded program |
CN105337755A (en) * | 2014-08-08 | 2016-02-17 | 阿里巴巴集团控股有限公司 | Master-slave architecture server, service processing method thereof and service processing system thereof |
CN104268229A (en) * | 2014-09-26 | 2015-01-07 | 北京金山安全软件有限公司 | Resource obtaining method and device based on multi-process browser |
CN109144741A (en) * | 2017-06-13 | 2019-01-04 | 广东神马搜索科技有限公司 | The method, apparatus and electronic equipment of interprocess communication |
CN109922319A (en) * | 2019-03-26 | 2019-06-21 | 重庆英卡电子有限公司 | RTSP agreement multiple video strems Parallel preconditioning method based on multi-core CPU |
CN110162452A (en) * | 2019-04-30 | 2019-08-23 | 广州微算互联信息技术有限公司 | A kind of analog detection method, device and storage medium for testing and control service |
CN110413386A (en) * | 2019-06-27 | 2019-11-05 | 深圳市富途网络科技有限公司 | Multiprocessing method, apparatus, terminal device and computer readable storage medium |
Non-Patent Citations (1)
Title |
---|
郭学兵: "基于Python的并行编程技术在批量气象规范报表入库处理中的应用" * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112346879A (en) * | 2020-11-06 | 2021-02-09 | 网易(杭州)网络有限公司 | Process management method and device, computer equipment and storage medium |
CN112346879B (en) * | 2020-11-06 | 2023-08-11 | 网易(杭州)网络有限公司 | Process management method, device, computer equipment and storage medium |
CN113326139A (en) * | 2021-06-28 | 2021-08-31 | 上海商汤科技开发有限公司 | Task processing method, device, equipment and storage medium |
CN113391909A (en) * | 2021-06-28 | 2021-09-14 | 上海商汤科技开发有限公司 | Process creation method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20200293360A1 (en) | Techniques to manage virtual classes for statistical tests | |
Manolache et al. | Schedulability analysis of applications with stochastic task execution times | |
CN111258744A (en) | Task processing method based on heterogeneous computation and software and hardware framework system | |
CN111124685A (en) | Big data processing method and device, electronic equipment and storage medium | |
US20080209436A1 (en) | Automated testing of programs using race-detection and flipping | |
US20130061231A1 (en) | Configurable computing architecture | |
CN111190741A (en) | Scheduling method, device and storage medium based on deep learning node calculation | |
Gu et al. | Improving execution concurrency of large-scale matrix multiplication on distributed data-parallel platforms | |
Hong et al. | Hierarchical dataflow modeling of iterative applications | |
CN114924748A (en) | Compiling method, device and equipment | |
CN113407343A (en) | Service processing method, device and equipment based on resource allocation | |
CN116521350B (en) | ETL scheduling method and device based on deep learning algorithm | |
US11531565B2 (en) | Techniques to generate execution schedules from neural network computation graphs | |
KR20150101870A (en) | Method and apparatus for avoiding bank conflict in memory | |
Elshazly et al. | Accelerated execution via eager-release of dependencies in task-based workflows | |
CN113760497A (en) | Scheduling task configuration method and device | |
CN114127681A (en) | Method and apparatus for enabling autonomous acceleration of data flow AI applications | |
Batko et al. | Actor model of Anemone functional language | |
Delestrac et al. | Demystifying the TensorFlow Eager Execution of Deep Learning Inference on a CPU-GPU Tandem | |
Beach et al. | Integrating acceleration devices using CometCloud | |
McDonagh et al. | Applying semi-synchronised task farming to large-scale computer vision problems | |
US20230273818A1 (en) | Highly parallel processing architecture with out-of-order resolution | |
CN117742928B (en) | Algorithm component execution scheduling method for federal learning | |
WO2024109312A1 (en) | Task scheduling execution method, and generation method and apparatus for task scheduling execution instruction | |
Friebe et al. | Work-in-Progress: Validation of Probabilistic Timing Models of a Periodic Task with Interference-A Case Study |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200508 |
|
RJ01 | Rejection of invention patent application after publication |