CN114519129A - Efficient sorting method and device for parallel machines of seismic big data clusters - Google Patents

Efficient sorting method and device for parallel machines of seismic big data clusters Download PDF

Info

Publication number
CN114519129A
CN114519129A CN202210127121.7A CN202210127121A CN114519129A CN 114519129 A CN114519129 A CN 114519129A CN 202210127121 A CN202210127121 A CN 202210127121A CN 114519129 A CN114519129 A CN 114519129A
Authority
CN
China
Prior art keywords
data
block
sorting
seismic
data block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210127121.7A
Other languages
Chinese (zh)
Inventor
刘雪飞
赵伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Yiyuan Xinghua Software Co ltd
Original Assignee
Beijing Yiyuan Xinghua Software Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Yiyuan Xinghua Software Co ltd filed Critical Beijing Yiyuan Xinghua Software Co ltd
Priority to CN202210127121.7A priority Critical patent/CN114519129A/en
Publication of CN114519129A publication Critical patent/CN114519129A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9035Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24552Database cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0673Single storage device
    • G06F3/0674Disk device
    • G06F3/0676Magnetic disk device
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The invention relates to the field of seismic exploration, in particular to a method and a device for efficiently sorting a large-scale seismic data cluster parallel machine, which carry out primary block-cutting processing on preset seismic data by setting a data sorting task instruction, send a primary data block to a computing node, carry out secondary block-cutting on the primary data block according to the number of idle cores inside the computing node and the number of pre-distributed trace gathers to form a secondary data block, start a plurality of thread sorting operations, input data by each thread according to the data block, then carry out screening and rephotography on the data by the traces, place the secondary data block in a cache, obtain the same group of data in a plurality of caches after the screening of a target data block is finished, form a trace gather block, output the trace gather block to a specified position of a disk array, realize the efficient sorting of large-scale seismic data by a distributed parallel data sorting model, and reasonably apply computing resources, the time cost of the data processing process is reduced, and the data sorting efficiency is effectively improved.

Description

Efficient sorting method and device for seismic big data cluster parallel machines
Technical Field
The invention relates to the field of seismic exploration, in particular to a method and a device for efficiently sorting a seismic big data cluster parallel machine.
Background
In the process of data processing, the sequence of data volumes is required to be rearranged according to different track head key values, and the data of massive large data volumes is rearranged.
In view of the above problems, no effective solution has been proposed.
Disclosure of Invention
The invention aims to provide a method and a device for efficiently sorting a large seismic data cluster parallel machine, aiming at the defects and shortcomings of the prior art, through setting a data sorting task instruction, performing primary block-cutting processing on preset seismic data to form a primary data block, sending the primary data block to a computing node, performing secondary block-cutting on the primary data block according to the number of idle cores inside the computing node and the number of pre-allocated trace sets to form a secondary data block, starting a plurality of threads to perform sorting operation, inputting data by each thread according to the data block, then performing screening and rephotography on the data by the traces, placing the data in a cache, and acquiring the same data in a plurality of caches after the screening of a target data block is completed to form a trace set which is respectively output to the designated position of a disk array, so that a distributed parallel data sorting model is realized, the method has the advantages that large-scale seismic data are efficiently sorted, the problem of read-write blockage in large-scale seismic data sorting is solved through input and output of large-block data, computing resources are reasonably applied, time cost in a data processing process is reduced, and data loading efficiency is effectively improved.
In order to realize the purpose, the invention adopts the technical scheme that:
an efficient sorting method for a seismic big data cluster parallel machine comprises the following steps: acquiring preset seismic data, and issuing a data sorting task instruction according to requirements;
carrying out primary block cutting processing on the preset seismic data according to the data sorting task instruction to form a primary data block, and sending the primary data block to a computing node;
performing secondary segmentation on the first-level data block according to the number of idle cores in the computing node and the number of pre-allocated tracks to form a second-level data block and placing the second-level data block in a cache;
and acquiring the same target keyword group data in a plurality of caches according to the in-cache gather queue to form a gather, and outputting the data to the specified position of the disk array according to the blocks.
In some embodiments, the obtaining of the data sorting task instruction, the block-cutting processing of the preset seismic data to form a primary data block, and the sending of the primary data block to a computing node includes the following steps:
the method comprises the steps that a user presets the number N of sorted used nodes to obtain the computing resource condition of each node in a cluster, wherein FM is the percentage of the amount of idle memory, and FC is the number of idle computing cores; when FM is greater than 10% and FC is greater than 10%, selecting the current node as an available sorting node;
sequentially screening nodes in the cluster until the number of obtained idle nodes is N; if all nodes in the cluster are screened, and the number N1 of the obtained idle nodes is less than N, resetting the value of N to be N1;
performing data block cutting according to the total number of files stored in the target data, the index information and the selected node condition, wherein the primary data block formed after data block cutting comprises data file block cutting and sorting information block cutting;
and partitioning the data file into blocks and distributing the matched sorting information to a plurality of preset computing nodes for distributed computing.
In some embodiments, the forming of the data file chunks comprises the following steps:
the data file is cut into blocks B, wherein B is F/N;
wherein F is the total number of files;
n is the number of idle nodes participating in calculation;
when the calculated data file block B value is larger than N, the data file block number is B; when the calculated value of the data file block B is less than N, each data subfile is divided into M parts,
M=(N/F)+1;
the total number of the target files is F1, F1 is M multiplied by F;
the data file is cut into B1, B1 ═ M × F1/N.
In some embodiments, the sorting information block is formed by:
on the basis of data partitioning according to a data file, obtaining data information of a target keyword group gather existing in each group of data blocks through data indexing, wherein the sorting information partitioning structure is as follows:
{Block1,(key1,key2...keyn),tcount},
the structural meaning is that the number of data tracks meeting a target keyword group (key1, key2.. keyn) in the Block of the Block1 data is tcount;
and the sorting information blocks are respectively matched with the data file blocks.
In some embodiments, the performing secondary segmentation on the first-level data block according to the number of idle cores inside the compute node and the number of trace sets of the allocated data block to form a second-level data block and placing the second-level data block in the cache includes:
acquiring the number of idle cores, and calculating that 50% of the number of idle cores is an applicable thread T, wherein T is at least 1;
the number of gathers in the data block is G;
the number E of the trace sets calculated by each thread is G/T;
when E is less than 1, modifying the value of T to make T equal to G, ensuring that each thread has at least one calculable gather, namely E is equal to 1;
after the data segmentation is completed, T threads are started on the computing nodes, E gathers are distributed to each thread, the data are rearranged, and the data are placed into the caches which are distributed in advance in sequence after the rearrangement.
In some embodiments, the data is input in blocks, the data is rearranged, and the step of placing the rearranged data into the buffer allocated in advance in sequence comprises:
reading a target data block into a memory, sequencing the data according to the target keyword group, judging the track set of the sequenced data to be placed in the corresponding position of a track cache in a way of polling according to the track, wherein the cache is a memory-based queue created by the sorting information block based on the nodes, and grouping the queues into a plurality of queues according to the target keyword group.
In some embodiments, according to a gather queue in a cache, acquiring data of a same target keyword in a plurality of caches to form a gather, collecting a plurality of gathers to form an output data block, and outputting the data to a specified position of a disk array block by block, the method includes the steps of:
in a plurality of caches, after collecting the data of the same group, outputting the data to a disk array, wherein the collection process is to poll all key word groups, collect the data channels meeting the requirements in each cache together to form a gather, then collect a plurality of gathers to form an output data block, and then uniformly output the output data block to the designated position of the disk.
The invention also discloses a high-efficiency sorting device of the parallel machine of the seismic big data cluster, which comprises,
the acquisition module is used for acquiring preset seismic data, data sorting task instructions, target keyword groups and the same group of data in the caches;
the cutting module is used for carrying out primary cutting processing on the seismic data according to a preset requirement to form a primary data block and carrying out secondary cutting processing to form a secondary data block;
the calculation module is used for calculating the primary dicing processing and the secondary dicing processing according to a preset requirement;
a memory for storing data information and computer program instructions;
a processor that when executing the computer program instructions implements: acquiring preset seismic data, and issuing a data sorting task instruction according to requirements;
carrying out primary block cutting processing on the preset seismic data according to the data sorting task instruction to form a primary data block, and sending the primary data block to a computing node;
performing secondary segmentation on the first-level data block according to the number of idle cores in the computing node and the number of pre-allocated tracks to form a second-level data block and placing the second-level data block in a cache;
and acquiring the same group of data in the caches to form a gather, and outputting the gather to the disk array respectively.
Advantageous effects
The invention provides a method and a device for efficiently sorting a large seismic data cluster parallel machine. Each thread implements data input according to a data block, then performs data screening and retaking according to a channel, and is placed in a cache, after a target data block is screened, a plurality of same group of data in the cache are obtained to form a channel set which is respectively output to the appointed position of a disk array, so that large-scale seismic data are efficiently sorted through a distributed parallel data sorting model, the problem of read-write blockage in large-scale seismic data sorting is solved through input and output of large-block data, computing resources are reasonably applied, the time cost of a data processing process is reduced, and the data loading efficiency is effectively improved.
Drawings
FIG. 1 is a flow chart of efficient sorting of seismic big data cluster parallel machines according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an efficient sorting structure of a seismic big data cluster parallel machine according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a thread sorting process according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a buffer queue structure according to an embodiment of the present invention;
FIG. 5 is a schematic structural diagram of a parallel machine efficient sorting device for seismic big data clusters, provided by an embodiment of the present invention;
FIG. 6 is a time comparison graph of the sorting algorithm disclosed in the present invention versus the prior art sorting algorithm for the seismic big data processing process.
Wherein the reference numbers indicate:
obtaining a module 1; a cutting module 2; a calculation module 3; a reservoir 4; a processor 5.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments; in the following description and in the claims, the terms "include" and "comprise" are used in an open-ended fashion, and thus should be interpreted to mean "include, but not limited to. The terms "first", "second", and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of technical features indicated; "plurality" means equal to or greater than two; all other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 5, the invention provides an efficient sorting device for a seismic big data cluster parallel machine, which comprises an acquisition module 1, a cutting module 2, a calculation module 3, a storage 4 and a processor 5.
Wherein,
the acquisition module 1 is used for acquiring preset seismic data, a data sorting task instruction, a target keyword group and the same group of data in the caches.
And the cutting module 2 is used for carrying out primary block cutting processing on the seismic data according to a preset requirement to form a primary data block and carrying out secondary block cutting processing to form a secondary data block.
And the calculating module 3 is used for calculating the primary dicing processing and the secondary dicing processing according to the preset requirements.
The memory 4 is used for storing data information and computer program instructions.
A processor 5, which when executing the computer program instructions, implements: acquiring preset seismic data, and issuing a data sorting task instruction according to requirements;
the preset seismic data are cut into blocks according to the data sorting task instruction to form primary data blocks, and the primary data blocks are sent to a computing node;
performing secondary segmentation on the first-level data block according to the number of idle cores in the computing node and the number of pre-allocated tracks to form a second-level data block and placing the second-level data block in a cache;
and acquiring the same target keyword group data in a plurality of caches according to the in-cache track set queue to form a track set, collecting a plurality of track sets to form an output data block, and outputting the data to the specified position of the disk array according to the block.
As shown in fig. 1-5, the present invention adopts the following technical solutions:
an efficient sorting method for a seismic big data cluster parallel machine comprises the following steps:
the acquisition module 1 acquires preset seismic data and a data sorting task instruction, and the processor 5 issues the data sorting task instruction according to the requirement;
after the preset seismic data are calculated according to the data sorting task instruction calculation module 3, the cutting module 2 carries out primary cutting processing to form a primary data block, and the primary data block is sent to a calculation node and stored in a storage 4;
after the calculation module 3 calculates the first-level data block according to the number of idle cores inside the calculation node and the number of pre-allocated track sets, the cutting module 2 performs secondary segmentation to form a second-level data block, the second-level data block is stored in the storage 4, a plurality of sorting threads are started, data are input into a memory according to the data block, data are screened and rearranged according to tracks, and the data are placed in corresponding track set queues in a cache;
the acquisition module 1 acquires the same target keyword group data in a plurality of caches to form a gather, collects a plurality of gathers to form an output data block, and outputs the data to the designated position of the disk array according to the block.
The preferred embodiment of the present invention is shown in fig. 1-5:
an efficient sorting method for a seismic big data cluster parallel machine comprises the following steps: the acquisition module 1 acquires preset seismic data and a data sorting task instruction, stores the preset seismic data and the data sorting task instruction in the storage 4, and the processor 5 issues the data sorting task instruction according to requirements;
the calculation module 3 calculates and processes the preset seismic data according to data tasks and cluster configuration conditions, the cutting module 2 performs one-time block cutting processing to form a primary data block, the primary data block formed after data block cutting comprises data file blocks and sorting information blocks, the data file blocks and the matching sorting information blocks are distributed to a plurality of preset calculation nodes, and distributed calculation is performed.
The method comprises the following steps that an acquisition module 1 acquires a data sorting task instruction, a cutting module 2 cuts preset seismic data into blocks to form primary data blocks, and the primary data blocks are sent to a computing node, and comprises the following steps:
the method comprises the steps that a user presets N, an acquisition module 1 acquires calculation resources of all nodes in a cluster, FM is the percentage of the amount of idle memory, and FC is the number of idle calculation cores; when FM > 10% and FC > 10%, processor 5 selects the current node as an available sort node.
Sequentially screening nodes in the cluster until the number of obtained idle nodes is N; if all the nodes in the cluster are screened, and the calculating module 3 calculates the number of the obtained idle nodes N1< N, the value of N is reset to be N1.
And the data block cutting module 2 is used for carrying out data block cutting processing according to the total number of files stored in the target data, the index information and the selected node condition, and the primary data block formed after data block cutting comprises data file block cutting and sorting information block cutting.
And partitioning the data file into blocks and distributing the matched sorting information to a plurality of preset computing nodes for distributed computing.
In some embodiments, the forming of the data file chunks comprises the steps of:
the data file is cut into blocks B, wherein B is F/N;
f is the total number of files, and N is the number of idle nodes participating in calculation;
when the value of the data file block B calculated by the calculating module 3 is greater than N, the number of the data file blocks is B; when the calculated data file block B value is less than N, each data subfile is divided into M parts, wherein M is (N/F) +1
The total number of the target files is F1, F1 is M multiplied by F;
the data file is cut into B1, B1 ═ M × F1/N.
In some embodiments, the sorting information block is formed by:
obtaining data information of a target keyword group gather existing in each group of data blocks through data indexing, wherein the sorting information blocks have the following structure:
{Block1,(key1,key2...keyn),tcount}
the structural meaning is that the number of data tracks meeting a target keyword group (key1, key2.. keyn) in the Block of the Block1 data is tcount;
and the sorting information blocks are respectively matched with the data file blocks.
In some embodiments, the performing secondary segmentation on the first-level data block according to the number of idle cores inside the compute node and the number of trace sets of the allocated data block to form a second-level data block and placing the second-level data block in the cache includes:
acquiring the number of idle cores, and calculating that 50% of the number of idle cores is an applicable thread T, wherein T is at least 1;
the number of gathers in the data block is G;
the trace gather number E calculated by each thread is G/T;
when E is less than 1, modifying the value of T to make T equal to G, ensuring that each thread has at least one calculable gather, namely E is equal to 1;
after the data segmentation is finished, T threads are started at the computing node, each thread is distributed with E gathers,
and rearranging the data, and putting the rearranged data into a buffer allocated in advance in sequence, as shown in fig. 3.
In some embodiments, the step of the thread inputting data into the node by data blocks, rearranging the data by lanes, and placing the rearranged data into the cache allocated in advance in sequence includes:
the processor 5 inputs data into the nodes according to the data blocks, sorts the input data according to the target keyword group, judges the trace sets of the sorted data to be put into the corresponding positions of the trace cache queues in a way of polling according to the trace, the cache is a memory-based queue created by the sorting information blocks based on the nodes, and groups the data into a plurality of queues according to the target keyword group, as shown in fig. 4, the data are grouped into a plurality of queues according to the value of the target keyword group, tcount is the number of all traces of each group of keywords of the current node task, and ltr is the number of bytes occupied by each trace. And (4) creating a buffer of tcount ltr bytes for each target keyword group, and storing the sorting data of the current node.
Target data are read into the memory in a whole block according to a default sequence, and the read-write utilization rate of the disk array is improved.
In some embodiments, according to a gather queue in a cache, acquiring data of the same target keyword group in a plurality of caches to form a gather, then collecting a plurality of gathers to form an output data block, and outputting the data to a specified position of a disk array according to the block, the method includes the steps of:
in a plurality of caches, after data of the same key group is collected, the data is output to the designated position of the disk array according to the data block, the collection process is to poll all (key1, key2.
Has the advantages that:
as shown in fig. 6, 5TB data is used for sorting test, and compared with the conventional commercial software F, it can be seen that, as the number of nodes increases, the sorting algorithm disclosed in the present patent has more and more obvious advantages, and has a better utilization rate for cluster resources. The distributed parallel data sorting model disclosed by the patent can support efficient sorting of large-scale seismic data, reasonably applies computing resources, improves data loading efficiency, reduces time cost in a data processing process, and can increase computing time to about 50% along with increase of parallelism compared with similar software.

Claims (8)

1. An efficient sorting method for a parallel machine of a seismic big data cluster is characterized by comprising the following steps:
acquiring preset seismic data, and issuing a data sorting task instruction according to requirements;
the preset seismic data are cut into blocks according to the data sorting task instruction to form primary data blocks, and the primary data blocks are sent to the computing nodes;
performing secondary segmentation on the first-level data block according to the number of idle cores in the computing node and the number of pre-allocated tracks to form a second-level data block and placing the second-level data block in a cache;
and acquiring the same target keyword group data in a plurality of caches according to the in-cache track set queue to form a track set, collecting a plurality of track sets to form an output data block, and outputting the data to the specified position of the disk array according to the block.
2. The efficient sorting method of the seismic big data cluster parallel machines according to claim 1,
the method comprises the following steps of obtaining a data sorting task instruction, carrying out block cutting processing on preset seismic data to form a primary data block, and sending the primary data block to a computing node, wherein the method comprises the following steps:
the method comprises the steps that a user presets the number N of sorted used nodes to obtain the computing resource condition of each node in a cluster, wherein FM is the percentage of the amount of idle memory, and FC is the number of idle computing cores; when FM is greater than 10% and FC is greater than 10%, selecting the current node as an available sorting node;
sequentially screening nodes in the cluster until the number of obtained idle nodes is N; if all nodes in the cluster are screened, and the number N1 of the obtained idle nodes is less than N, resetting the value of N to be N1;
performing data block cutting according to the total number of files stored in the target data, the index information and the selected node condition, wherein the primary data block formed after data block cutting comprises data file block cutting and sorting information block cutting;
and partitioning the data file into blocks and distributing the matched sorting information to a plurality of preset computing nodes for distributed computing.
3. The efficient sorting method of the seismic big data cluster parallel machines according to claim 2,
the formation of the data file blocks comprises the following calculation steps:
the data file is cut into blocks B, wherein B is F/N;
wherein F is the total number of files;
n is the number of idle nodes participating in calculation;
when the calculated data file block B value is larger than N, the data file block quantity is B; when the calculated value of the data file block B is less than N, each data subfile is divided into M parts,
M=(N/F)+1
the total number of the target files is F1, F1 is M multiplied by F;
the data file is cut into B1, B1 ═ M × F1/N.
4. The efficient sorting method for the seismic big data cluster parallel machines according to claim 3, wherein the sorting information blocks are formed by the following steps:
on the basis of data partitioning according to data files, data information of a target keyword group gather existing in each group of data blocks is obtained through data indexing, and the sorting information partitioning structure is as follows:
{Block1,(key1,key2...keyn),tcount}
the structural meaning is that the number of data tracks meeting a target keyword group (key1, key2.. keyn) in the Block of the Block1 data is tcount;
and the sorting information blocks are respectively matched with the data file blocks.
5. The efficient sorting method for the large seismic data cluster parallel machines according to claim 1, wherein the step of performing secondary segmentation on the primary data block according to the number of idle cores inside the computing node and the number of trace gathers for distributing the data block to form a secondary data block and placing the secondary data block in the cache comprises the following steps:
acquiring the number of idle cores, and calculating that 50% of the number of idle cores is an applicable thread T, wherein T is at least 1;
the number of gathers in the data block is G;
the trace gather number E calculated by each thread is G/T;
when E is less than 1, modifying the value of T to make T equal to G, ensuring that each thread has at least one calculable gather, namely E is equal to 1;
after the data segmentation is completed, T threads are started on the computing nodes, E gathers are distributed to each thread, the data are rearranged, and the data are placed into the caches which are distributed in advance in sequence after the rearrangement.
6. The efficient sorting method for the seismic big data cluster parallel machines according to claim 5, characterized in that the data rearrangement is carried out according to the block input data, and the step of putting the rearranged data into the pre-allocated caches in sequence comprises the following steps:
reading target data into a memory according to blocks, sequencing the data according to the target keyword group, judging the track set of the sequenced data to be placed in the corresponding position of a track cache in a way of polling according to tracks, wherein the cache is a memory-based queue created by sorting information blocks based on nodes, and grouping the queues into a plurality of queues according to the target keyword group.
7. The efficient sorting method for the seismic big data cluster parallel machines according to claim 1, wherein the step of obtaining the same target keyword group data in a plurality of caches according to a track set queue in the caches to form a track set, collecting a plurality of track sets to form an output data block, and outputting the data to the designated position of the disk array according to the block comprises the following steps:
in the multiple caches, after collecting the data of the same group, outputting the data to the disk array, wherein the collection process is to poll all key word groups, collect the data tracks meeting the requirements in each cache together to form a track set block, and then uniformly outputting the data to the designated position of the disk array.
8. An efficient sorting device for a large seismic data cluster parallel machine is characterized by comprising,
the acquisition module is used for acquiring preset seismic data, data sorting task instructions, target keyword groups and the same group of data in a plurality of caches;
the cutting module is used for carrying out primary cutting processing on the seismic data according to a preset requirement to form a primary data block and carrying out secondary cutting processing to form a secondary data block;
the calculation module is used for calculating the primary dicing processing and the secondary dicing processing according to a preset requirement;
a memory for storing data information and computer program instructions;
a processor that when executing the computer program instructions implements: acquiring preset seismic data, and issuing a data sorting task instruction according to requirements;
carrying out primary block cutting processing on the preset seismic data according to the data sorting task instruction to form a primary data block, and sending the primary data block to a computing node;
performing secondary segmentation on the first-level data block according to the number of idle cores in the computing node and the number of pre-distributed track sets to form a second-level data block and placing the second-level data block in a cache;
and acquiring the same group of data in the caches to form a gather, and outputting the gather to the disk array respectively.
CN202210127121.7A 2022-02-11 2022-02-11 Efficient sorting method and device for parallel machines of seismic big data clusters Pending CN114519129A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210127121.7A CN114519129A (en) 2022-02-11 2022-02-11 Efficient sorting method and device for parallel machines of seismic big data clusters

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210127121.7A CN114519129A (en) 2022-02-11 2022-02-11 Efficient sorting method and device for parallel machines of seismic big data clusters

Publications (1)

Publication Number Publication Date
CN114519129A true CN114519129A (en) 2022-05-20

Family

ID=81595961

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210127121.7A Pending CN114519129A (en) 2022-02-11 2022-02-11 Efficient sorting method and device for parallel machines of seismic big data clusters

Country Status (1)

Country Link
CN (1) CN114519129A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104459781A (en) * 2014-12-09 2015-03-25 中国石油天然气集团公司 Three-dimensional pre-stack seismic data random noise degeneration method
CN104570063A (en) * 2015-02-11 2015-04-29 安徽吉拓电子技术有限公司 Parallel extraction method of seismic channel set of seismic data
CN106250101A (en) * 2015-06-12 2016-12-21 中国石油化工股份有限公司 Migration before stack method for parallel processing based on MapReduce and device
CN109657197A (en) * 2017-10-10 2019-04-19 中国石油化工股份有限公司 A kind of pre-stack depth migration calculation method and system
CN110187383A (en) * 2019-05-27 2019-08-30 中海石油(中国)有限公司 A kind of quick method for separating of sea wide-azimuth seismic data COV trace gather
WO2020041928A1 (en) * 2018-08-27 2020-03-05 深圳市锐明技术股份有限公司 Data storage method and system and terminal device
CN111767264A (en) * 2019-04-02 2020-10-13 中国石油化工股份有限公司 Distributed storage method and data reading method based on geological information coding

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104459781A (en) * 2014-12-09 2015-03-25 中国石油天然气集团公司 Three-dimensional pre-stack seismic data random noise degeneration method
CN104570063A (en) * 2015-02-11 2015-04-29 安徽吉拓电子技术有限公司 Parallel extraction method of seismic channel set of seismic data
CN106250101A (en) * 2015-06-12 2016-12-21 中国石油化工股份有限公司 Migration before stack method for parallel processing based on MapReduce and device
CN109657197A (en) * 2017-10-10 2019-04-19 中国石油化工股份有限公司 A kind of pre-stack depth migration calculation method and system
WO2020041928A1 (en) * 2018-08-27 2020-03-05 深圳市锐明技术股份有限公司 Data storage method and system and terminal device
CN111767264A (en) * 2019-04-02 2020-10-13 中国石油化工股份有限公司 Distributed storage method and data reading method based on geological information coding
CN110187383A (en) * 2019-05-27 2019-08-30 中海石油(中国)有限公司 A kind of quick method for separating of sea wide-azimuth seismic data COV trace gather

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘兰锋等: "基于数据库的并行抽道集技术", 石油地球物理勘探, vol. 2015, no. 03, 15 June 2015 (2015-06-15) *
文必龙等: "地震资料分布式存取的效率优化设计", 计算机与数字工程, vol. 2014, no. 08, 20 August 2014 (2014-08-20) *

Similar Documents

Publication Publication Date Title
CN110019396B (en) Data analysis system and method based on distributed multidimensional analysis
CN106681846B (en) Statistical method, device and system of log data
CN107122126B (en) Data migration method, device and system
CN106790529B (en) Dispatching method, control centre and the scheduling system of computing resource
CN101459901B (en) Vector map data transmission method based on multi-stage slicing mode
US8756309B2 (en) Resource information collecting device, resource information collecting method, program, and collection schedule generating device
US8321476B2 (en) Method and system for determining boundary values dynamically defining key value bounds of two or more disjoint subsets of sort run-based parallel processing of data from databases
WO2015024474A1 (en) Rapid calculation method for electric power reliability index based on multithread processing of cache data
CN111259933B (en) High-dimensional characteristic data classification method and system based on distributed parallel decision tree
CN113568813B (en) Mass network performance data acquisition method, device and system
CN114756629A (en) Multi-source heterogeneous data interaction analysis engine and method based on SQL
CN114519129A (en) Efficient sorting method and device for parallel machines of seismic big data clusters
CN114238360A (en) User behavior analysis system
CN107679133B (en) Mining method applicable to massive real-time PMU data
CN111078790B (en) Method and system for synchronizing isolated block data in block chain and storage medium
CN102521413A (en) Data reading device based on network reports and method
CN110716986B (en) Big data analysis system and application method thereof
CN112988736B (en) Mass data quality checking method and system
CN112069168B (en) Cloud storage method for equipment operation data
CN111737347B (en) Method and device for sequentially segmenting data on Spark platform
CN114416785A (en) Stream type enterprise big data processing method and storage medium
CN112148929A (en) Big data analysis method and device based on tree network
CN107784032A (en) Gradual output intent, the apparatus and system of a kind of data query result
CN113282568A (en) IOT big data real-time sequence flow analysis application technical method
JP2013101539A (en) Sampling device, sampling program, and method therefor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination