CN114519129A

CN114519129A - Efficient sorting method and device for parallel machines of seismic big data clusters

Info

Publication number: CN114519129A
Application number: CN202210127121.7A
Authority: CN
Inventors: 刘雪飞; 赵伟
Original assignee: Beijing Yiyuan Xinghua Software Co ltd
Current assignee: Beijing Yiyuan Xinghua Software Co ltd
Priority date: 2022-02-11
Filing date: 2022-02-11
Publication date: 2022-05-20

Abstract

The invention relates to the field of seismic exploration, in particular to a method and a device for efficiently sorting a large-scale seismic data cluster parallel machine, which carry out primary block-cutting processing on preset seismic data by setting a data sorting task instruction, send a primary data block to a computing node, carry out secondary block-cutting on the primary data block according to the number of idle cores inside the computing node and the number of pre-distributed trace gathers to form a secondary data block, start a plurality of thread sorting operations, input data by each thread according to the data block, then carry out screening and rephotography on the data by the traces, place the secondary data block in a cache, obtain the same group of data in a plurality of caches after the screening of a target data block is finished, form a trace gather block, output the trace gather block to a specified position of a disk array, realize the efficient sorting of large-scale seismic data by a distributed parallel data sorting model, and reasonably apply computing resources, the time cost of the data processing process is reduced, and the data sorting efficiency is effectively improved.

Description

Efficient sorting method and device for seismic big data cluster parallel machines

Technical Field

The invention relates to the field of seismic exploration, in particular to a method and a device for efficiently sorting a seismic big data cluster parallel machine.

Background

In the process of data processing, the sequence of data volumes is required to be rearranged according to different track head key values, and the data of massive large data volumes is rearranged.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The invention aims to provide a method and a device for efficiently sorting a large seismic data cluster parallel machine, aiming at the defects and shortcomings of the prior art, through setting a data sorting task instruction, performing primary block-cutting processing on preset seismic data to form a primary data block, sending the primary data block to a computing node, performing secondary block-cutting on the primary data block according to the number of idle cores inside the computing node and the number of pre-allocated trace sets to form a secondary data block, starting a plurality of threads to perform sorting operation, inputting data by each thread according to the data block, then performing screening and rephotography on the data by the traces, placing the data in a cache, and acquiring the same data in a plurality of caches after the screening of a target data block is completed to form a trace set which is respectively output to the designated position of a disk array, so that a distributed parallel data sorting model is realized, the method has the advantages that large-scale seismic data are efficiently sorted, the problem of read-write blockage in large-scale seismic data sorting is solved through input and output of large-block data, computing resources are reasonably applied, time cost in a data processing process is reduced, and data loading efficiency is effectively improved.

In order to realize the purpose, the invention adopts the technical scheme that:

an efficient sorting method for a seismic big data cluster parallel machine comprises the following steps: acquiring preset seismic data, and issuing a data sorting task instruction according to requirements;

carrying out primary block cutting processing on the preset seismic data according to the data sorting task instruction to form a primary data block, and sending the primary data block to a computing node;

performing secondary segmentation on the first-level data block according to the number of idle cores in the computing node and the number of pre-allocated tracks to form a second-level data block and placing the second-level data block in a cache;

and acquiring the same target keyword group data in a plurality of caches according to the in-cache gather queue to form a gather, and outputting the data to the specified position of the disk array according to the blocks.

In some embodiments, the obtaining of the data sorting task instruction, the block-cutting processing of the preset seismic data to form a primary data block, and the sending of the primary data block to a computing node includes the following steps:

the method comprises the steps that a user presets the number N of sorted used nodes to obtain the computing resource condition of each node in a cluster, wherein FM is the percentage of the amount of idle memory, and FC is the number of idle computing cores; when FM is greater than 10% and FC is greater than 10%, selecting the current node as an available sorting node;

sequentially screening nodes in the cluster until the number of obtained idle nodes is N; if all nodes in the cluster are screened, and the number N1 of the obtained idle nodes is less than N, resetting the value of N to be N1;

performing data block cutting according to the total number of files stored in the target data, the index information and the selected node condition, wherein the primary data block formed after data block cutting comprises data file block cutting and sorting information block cutting;

and partitioning the data file into blocks and distributing the matched sorting information to a plurality of preset computing nodes for distributed computing.

In some embodiments, the forming of the data file chunks comprises the following steps:

the data file is cut into blocks B, wherein B is F/N;

wherein F is the total number of files;

n is the number of idle nodes participating in calculation;

when the calculated data file block B value is larger than N, the data file block number is B; when the calculated value of the data file block B is less than N, each data subfile is divided into M parts,

M＝(N/F)+1；

the total number of the target files is F1, F1 is M multiplied by F;

the data file is cut into B1, B1 ═ M × F1/N.

In some embodiments, the sorting information block is formed by:

on the basis of data partitioning according to a data file, obtaining data information of a target keyword group gather existing in each group of data blocks through data indexing, wherein the sorting information partitioning structure is as follows:

{Block1,(key1,key2...keyn),tcount}，

the structural meaning is that the number of data tracks meeting a target keyword group (key1, key2.. keyn) in the Block of the Block1 data is tcount;

and the sorting information blocks are respectively matched with the data file blocks.

In some embodiments, the performing secondary segmentation on the first-level data block according to the number of idle cores inside the compute node and the number of trace sets of the allocated data block to form a second-level data block and placing the second-level data block in the cache includes:

acquiring the number of idle cores, and calculating that 50% of the number of idle cores is an applicable thread T, wherein T is at least 1;

the number of gathers in the data block is G;

the number E of the trace sets calculated by each thread is G/T;

when E is less than 1, modifying the value of T to make T equal to G, ensuring that each thread has at least one calculable gather, namely E is equal to 1;

after the data segmentation is completed, T threads are started on the computing nodes, E gathers are distributed to each thread, the data are rearranged, and the data are placed into the caches which are distributed in advance in sequence after the rearrangement.

In some embodiments, the data is input in blocks, the data is rearranged, and the step of placing the rearranged data into the buffer allocated in advance in sequence comprises:

reading a target data block into a memory, sequencing the data according to the target keyword group, judging the track set of the sequenced data to be placed in the corresponding position of a track cache in a way of polling according to the track, wherein the cache is a memory-based queue created by the sorting information block based on the nodes, and grouping the queues into a plurality of queues according to the target keyword group.

In some embodiments, according to a gather queue in a cache, acquiring data of a same target keyword in a plurality of caches to form a gather, collecting a plurality of gathers to form an output data block, and outputting the data to a specified position of a disk array block by block, the method includes the steps of:

in a plurality of caches, after collecting the data of the same group, outputting the data to a disk array, wherein the collection process is to poll all key word groups, collect the data channels meeting the requirements in each cache together to form a gather, then collect a plurality of gathers to form an output data block, and then uniformly output the output data block to the designated position of the disk.

The invention also discloses a high-efficiency sorting device of the parallel machine of the seismic big data cluster, which comprises,

the acquisition module is used for acquiring preset seismic data, data sorting task instructions, target keyword groups and the same group of data in the caches;

the cutting module is used for carrying out primary cutting processing on the seismic data according to a preset requirement to form a primary data block and carrying out secondary cutting processing to form a secondary data block;

the calculation module is used for calculating the primary dicing processing and the secondary dicing processing according to a preset requirement;

a memory for storing data information and computer program instructions;

a processor that when executing the computer program instructions implements: acquiring preset seismic data, and issuing a data sorting task instruction according to requirements;

and acquiring the same group of data in the caches to form a gather, and outputting the gather to the disk array respectively.

Advantageous effects

The invention provides a method and a device for efficiently sorting a large seismic data cluster parallel machine. Each thread implements data input according to a data block, then performs data screening and retaking according to a channel, and is placed in a cache, after a target data block is screened, a plurality of same group of data in the cache are obtained to form a channel set which is respectively output to the appointed position of a disk array, so that large-scale seismic data are efficiently sorted through a distributed parallel data sorting model, the problem of read-write blockage in large-scale seismic data sorting is solved through input and output of large-block data, computing resources are reasonably applied, the time cost of a data processing process is reduced, and the data loading efficiency is effectively improved.

Drawings

FIG. 1 is a flow chart of efficient sorting of seismic big data cluster parallel machines according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an efficient sorting structure of a seismic big data cluster parallel machine according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a thread sorting process according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a buffer queue structure according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a parallel machine efficient sorting device for seismic big data clusters, provided by an embodiment of the present invention;

FIG. 6 is a time comparison graph of the sorting algorithm disclosed in the present invention versus the prior art sorting algorithm for the seismic big data processing process.

Wherein the reference numbers indicate:

obtaining a module 1; a cutting module 2; a calculation module 3; a reservoir 4; a processor 5.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments; in the following description and in the claims, the terms "include" and "comprise" are used in an open-ended fashion, and thus should be interpreted to mean "include, but not limited to. The terms "first", "second", and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of technical features indicated; "plurality" means equal to or greater than two; all other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 5, the invention provides an efficient sorting device for a seismic big data cluster parallel machine, which comprises an acquisition module 1, a cutting module 2, a calculation module 3, a storage 4 and a processor 5.

Wherein,

the acquisition module 1 is used for acquiring preset seismic data, a data sorting task instruction, a target keyword group and the same group of data in the caches.

And the cutting module 2 is used for carrying out primary block cutting processing on the seismic data according to a preset requirement to form a primary data block and carrying out secondary block cutting processing to form a secondary data block.

And the calculating module 3 is used for calculating the primary dicing processing and the secondary dicing processing according to the preset requirements.

The memory 4 is used for storing data information and computer program instructions.

A processor 5, which when executing the computer program instructions, implements: acquiring preset seismic data, and issuing a data sorting task instruction according to requirements;

the preset seismic data are cut into blocks according to the data sorting task instruction to form primary data blocks, and the primary data blocks are sent to a computing node;

and acquiring the same target keyword group data in a plurality of caches according to the in-cache track set queue to form a track set, collecting a plurality of track sets to form an output data block, and outputting the data to the specified position of the disk array according to the block.

As shown in fig. 1-5, the present invention adopts the following technical solutions:

an efficient sorting method for a seismic big data cluster parallel machine comprises the following steps:

the acquisition module 1 acquires preset seismic data and a data sorting task instruction, and the processor 5 issues the data sorting task instruction according to the requirement;

after the preset seismic data are calculated according to the data sorting task instruction calculation module 3, the cutting module 2 carries out primary cutting processing to form a primary data block, and the primary data block is sent to a calculation node and stored in a storage 4;

after the calculation module 3 calculates the first-level data block according to the number of idle cores inside the calculation node and the number of pre-allocated track sets, the cutting module 2 performs secondary segmentation to form a second-level data block, the second-level data block is stored in the storage 4, a plurality of sorting threads are started, data are input into a memory according to the data block, data are screened and rearranged according to tracks, and the data are placed in corresponding track set queues in a cache;

the acquisition module 1 acquires the same target keyword group data in a plurality of caches to form a gather, collects a plurality of gathers to form an output data block, and outputs the data to the designated position of the disk array according to the block.

The preferred embodiment of the present invention is shown in fig. 1-5:

an efficient sorting method for a seismic big data cluster parallel machine comprises the following steps: the acquisition module 1 acquires preset seismic data and a data sorting task instruction, stores the preset seismic data and the data sorting task instruction in the storage 4, and the processor 5 issues the data sorting task instruction according to requirements;

the calculation module 3 calculates and processes the preset seismic data according to data tasks and cluster configuration conditions, the cutting module 2 performs one-time block cutting processing to form a primary data block, the primary data block formed after data block cutting comprises data file blocks and sorting information blocks, the data file blocks and the matching sorting information blocks are distributed to a plurality of preset calculation nodes, and distributed calculation is performed.

The method comprises the following steps that an acquisition module 1 acquires a data sorting task instruction, a cutting module 2 cuts preset seismic data into blocks to form primary data blocks, and the primary data blocks are sent to a computing node, and comprises the following steps:

the method comprises the steps that a user presets N, an acquisition module 1 acquires calculation resources of all nodes in a cluster, FM is the percentage of the amount of idle memory, and FC is the number of idle calculation cores; when FM > 10% and FC > 10%, processor 5 selects the current node as an available sort node.

Sequentially screening nodes in the cluster until the number of obtained idle nodes is N; if all the nodes in the cluster are screened, and the calculating module 3 calculates the number of the obtained idle nodes N1< N, the value of N is reset to be N1.

And the data block cutting module 2 is used for carrying out data block cutting processing according to the total number of files stored in the target data, the index information and the selected node condition, and the primary data block formed after data block cutting comprises data file block cutting and sorting information block cutting.

In some embodiments, the forming of the data file chunks comprises the steps of:

the data file is cut into blocks B, wherein B is F/N;

f is the total number of files, and N is the number of idle nodes participating in calculation;

when the value of the data file block B calculated by the calculating module 3 is greater than N, the number of the data file blocks is B; when the calculated data file block B value is less than N, each data subfile is divided into M parts, wherein M is (N/F) +1

The total number of the target files is F1, F1 is M multiplied by F;

the data file is cut into B1, B1 ═ M × F1/N.

In some embodiments, the sorting information block is formed by:

obtaining data information of a target keyword group gather existing in each group of data blocks through data indexing, wherein the sorting information blocks have the following structure:

{Block1,(key1,key2...keyn),tcount}

the number of gathers in the data block is G;

the trace gather number E calculated by each thread is G/T;

after the data segmentation is finished, T threads are started at the computing node, each thread is distributed with E gathers,

and rearranging the data, and putting the rearranged data into a buffer allocated in advance in sequence, as shown in fig. 3.

In some embodiments, the step of the thread inputting data into the node by data blocks, rearranging the data by lanes, and placing the rearranged data into the cache allocated in advance in sequence includes:

the processor 5 inputs data into the nodes according to the data blocks, sorts the input data according to the target keyword group, judges the trace sets of the sorted data to be put into the corresponding positions of the trace cache queues in a way of polling according to the trace, the cache is a memory-based queue created by the sorting information blocks based on the nodes, and groups the data into a plurality of queues according to the target keyword group, as shown in fig. 4, the data are grouped into a plurality of queues according to the value of the target keyword group, tcount is the number of all traces of each group of keywords of the current node task, and ltr is the number of bytes occupied by each trace. And (4) creating a buffer of tcount ltr bytes for each target keyword group, and storing the sorting data of the current node.

Target data are read into the memory in a whole block according to a default sequence, and the read-write utilization rate of the disk array is improved.

In some embodiments, according to a gather queue in a cache, acquiring data of the same target keyword group in a plurality of caches to form a gather, then collecting a plurality of gathers to form an output data block, and outputting the data to a specified position of a disk array according to the block, the method includes the steps of:

in a plurality of caches, after data of the same key group is collected, the data is output to the designated position of the disk array according to the data block, the collection process is to poll all (key1, key2.

Has the advantages that:

as shown in fig. 6, 5TB data is used for sorting test, and compared with the conventional commercial software F, it can be seen that, as the number of nodes increases, the sorting algorithm disclosed in the present patent has more and more obvious advantages, and has a better utilization rate for cluster resources. The distributed parallel data sorting model disclosed by the patent can support efficient sorting of large-scale seismic data, reasonably applies computing resources, improves data loading efficiency, reduces time cost in a data processing process, and can increase computing time to about 50% along with increase of parallelism compared with similar software.

Claims

1. An efficient sorting method for a parallel machine of a seismic big data cluster is characterized by comprising the following steps:

acquiring preset seismic data, and issuing a data sorting task instruction according to requirements;

the preset seismic data are cut into blocks according to the data sorting task instruction to form primary data blocks, and the primary data blocks are sent to the computing nodes;

2. The efficient sorting method of the seismic big data cluster parallel machines according to claim 1,

the method comprises the following steps of obtaining a data sorting task instruction, carrying out block cutting processing on preset seismic data to form a primary data block, and sending the primary data block to a computing node, wherein the method comprises the following steps:

3. The efficient sorting method of the seismic big data cluster parallel machines according to claim 2,

the formation of the data file blocks comprises the following calculation steps:

the data file is cut into blocks B, wherein B is F/N;

wherein F is the total number of files;

n is the number of idle nodes participating in calculation;

when the calculated data file block B value is larger than N, the data file block quantity is B; when the calculated value of the data file block B is less than N, each data subfile is divided into M parts,

M＝(N/F)+1

the total number of the target files is F1, F1 is M multiplied by F;

the data file is cut into B1, B1 ═ M × F1/N.

4. The efficient sorting method for the seismic big data cluster parallel machines according to claim 3, wherein the sorting information blocks are formed by the following steps:

on the basis of data partitioning according to data files, data information of a target keyword group gather existing in each group of data blocks is obtained through data indexing, and the sorting information partitioning structure is as follows:

{Block1,(key1,key2...keyn),tcount}

5. The efficient sorting method for the large seismic data cluster parallel machines according to claim 1, wherein the step of performing secondary segmentation on the primary data block according to the number of idle cores inside the computing node and the number of trace gathers for distributing the data block to form a secondary data block and placing the secondary data block in the cache comprises the following steps:

the number of gathers in the data block is G;

the trace gather number E calculated by each thread is G/T;

6. The efficient sorting method for the seismic big data cluster parallel machines according to claim 5, characterized in that the data rearrangement is carried out according to the block input data, and the step of putting the rearranged data into the pre-allocated caches in sequence comprises the following steps:

reading target data into a memory according to blocks, sequencing the data according to the target keyword group, judging the track set of the sequenced data to be placed in the corresponding position of a track cache in a way of polling according to tracks, wherein the cache is a memory-based queue created by sorting information blocks based on nodes, and grouping the queues into a plurality of queues according to the target keyword group.

7. The efficient sorting method for the seismic big data cluster parallel machines according to claim 1, wherein the step of obtaining the same target keyword group data in a plurality of caches according to a track set queue in the caches to form a track set, collecting a plurality of track sets to form an output data block, and outputting the data to the designated position of the disk array according to the block comprises the following steps:

in the multiple caches, after collecting the data of the same group, outputting the data to the disk array, wherein the collection process is to poll all key word groups, collect the data tracks meeting the requirements in each cache together to form a track set block, and then uniformly outputting the data to the designated position of the disk array.

8. An efficient sorting device for a large seismic data cluster parallel machine is characterized by comprising,

the acquisition module is used for acquiring preset seismic data, data sorting task instructions, target keyword groups and the same group of data in a plurality of caches;

a memory for storing data information and computer program instructions;

performing secondary segmentation on the first-level data block according to the number of idle cores in the computing node and the number of pre-distributed track sets to form a second-level data block and placing the second-level data block in a cache;