CN111931000A

CN111931000A - Large-scale vector field oriented data processing method

Info

Publication number: CN111931000A
Application number: CN202010807796.7A
Authority: CN
Inventors: 答海玲; 张柱; 郑坤; 冉秀桃
Original assignee: Wuhan Zhaotu Science & Technology Co ltd
Current assignee: Wuhan Zhaotu Science & Technology Co ltd
Priority date: 2020-08-12
Filing date: 2020-08-12
Publication date: 2020-11-13
Anticipated expiration: 2040-08-12
Also published as: CN111931000B

Abstract

The invention provides a large-scale vector field oriented data processing method. The method equally divides the large-scale vector field data into sub-regions, and codes the sub-regions in sequence according to the positions of the sub-region data; reading sub-region data and merging the sub-region data into a data block; reading the data block, distributing the data block to a corresponding partition through Hash mapping according to the code of the data block, distributing the data block adjacent to the space to the same partition, and sequentially judging the unicity and the integrity of the data block according to the code value by the flow data valve. According to the method, the data blocks are distributed to the corresponding partitions through Hash mapping according to the codes of the data blocks, so that the data blocks adjacent to each other in space are distributed to the same partition, the data blocks adjacent to each other in space are not required to be searched again, and the efficiency of iterative computation is improved; the stream data valve eliminates data blocks with repeated and missing information, and guarantees the uniqueness and integrity of the data stream.

Description

Large-scale vector field oriented data processing method

Technical Field

The invention relates to a large-scale vector field oriented data processing method, and belongs to the field of high-performance computing frames.

Background

In the information age, sensors and information technologies are rapidly developed, vector field data and application requirements thereof are rapidly increased, and real-time calculation for large-scale vector field data also faces higher and higher performance requirements. Taking the application of wind field data in the field of meteorology as an example, the data is acquired by widely distributed wind speed and direction sensors, synchronously summarized in meteorological departments of a plurality of areas, analyzed and calculated in a unified cloud computing environment, and finally used for analyzing the wind field structure, performing typhoon early warning and the like, but in the face of huge data volume, the application of the data by the meteorological departments still faces minute-level delay. Similarly, in marine science, ocean current data can be used to analyze the influence of ocean currents on climate, and the same problem is faced in data processing performance.

Vector field data has a wide range of applications. For example, natural disasters such as typhoons and tsunamis can be predicted by analyzing wind field and ocean current data, early warning information is sent out in time, and therefore citizens can take precautionary measures in time, life safety of the citizens is guaranteed, and property loss of people is reduced. If the analysis and utilization of the data are not timely enough, the practical value of the data is correspondingly reduced, so that the research on the high-performance calculation of the large-scale vector field data has very important practical significance.

The existing large-scale vector field data calculation method divides large-scale vector field data into fine-scale data blocks, randomly transmits the fine-scale data blocks to different nodes and performs calculation, but the vector field data has spatial relevance, namely when calculating a certain vector field data, vector field data adjacent to the vector field data in space needs to be used, so that the calculation amount of the vector field data is greatly increased, and meanwhile, a large amount of communication overhead is increased.

Disclosure of Invention

In order to solve the technical problems in the prior art, the invention provides a high-performance computing framework for large-scale vector field data, which reduces the large-scale data computation caused by the spatial correlation of the vector field data and improves the unicity and the integrity of the output data stream.

The technical scheme for realizing the aim of the invention is that the method for processing the large-scale vector field data at least comprises the following steps:

(1) equally dividing large-scale vector field data, equally dividing each divided vector field data formed after the equally dividing again, repeating the steps until the number of equally dividing times reaches the dimension numerical value of the vector field data, finally equally dividing to form sub-regions, and coding in sequence according to the positions of the sub-region data;

(2) setting a maximum merging number, reading all the subarea data, and sequentially merging the subarea data into an area block from small to large according to the maximum merging number;

(3) coding is carried out successively according to the formation of the data blocks, the data blocks are read, the data blocks are distributed to corresponding partitions through Hash mapping according to the coding of the data blocks, and the data blocks adjacent to each other in space are distributed to the same partition;

the hash mapping formula is

Wherein A is the number of partitions, C is the number of codes of the data block, M is the total number of partitions,

the number of the partitions is rounded down;

(4) checking the uniqueness and integrity of the data block by using a flow data valve; the method comprises the following specific steps:

1) setting different buffer areas according to the coding values of the data blocks;

2) sequentially distributing the data blocks into corresponding buffer areas according to the coding numerical values and judging whether the data block to be distributed is the same as a certain data block in the buffer areas or not, if so, replacing the same data block in the buffer areas with the data block, and if not, adding the data block into the corresponding buffer areas;

3) judging whether the data in the data block in the buffer area is complete, if so, outputting the data block quickly, and if not, not outputting the data block;

(5) and performing iterative computation on the data blocks, namely computing by always using the result data of the last computation, and combining the computed data blocks to form a data stream for output.

The technical scheme is further improved as follows: the equal division into the cross shape is equally divided.

And the codes of the sub-region data in the step (1) are sequentially marked from top to bottom according to 00, 01, 10 and 11, the codes of the sub-region data after further equal division are further marked, the original marks are reserved, and suffix marks are added according to the rule.

And step 1) the buffer only stores data blocks of the same encoding prefix.

And the data block data structure in the step 2) is key value name, data type and key value.

And the specific method of whether the data block to be allocated is the same as a certain data block in the buffer in step 2) is as follows: and judging whether the key values of the data block to be distributed and the data block in the buffer area are the same or not.

And the specific method for judging whether the data in the data block in the buffer area is complete in the step 3) is as follows: and sequentially checking whether all the minimum unit areas in the key values of the data block contain data values, if so, completing the data in the data block, and otherwise, completing the data in the data block.

According to the technical scheme, the large-scale vector field data processing method provided by the invention equally divides large-scale vector field data, equally divides each divided vector field data formed after the division again, repeats the steps until the number of the equally divided times reaches the dimension numerical value of the vector field data, finally equally divides the vector field data to form sub-regions, and sequentially encodes the sub-regions according to the positions of the sub-region data; therefore, the data scale of large-scale vector field data is reduced, and the complexity of data communication is reduced;

and simultaneously, all sub-region data in the designated range are read and combined into a data block, so that the whole large-scale vector field data does not need to be searched, and the data transmission efficiency is improved.

The method reads the data blocks and distributes the data blocks to corresponding partitions through Hash mapping according to the codes of the data blocks, and the distribution mode distributes the data blocks adjacent to the space to the same partition without searching the data blocks adjacent to the space of the data blocks again, so that the efficiency of iterative computation is improved;

the stream data valve distributes the data blocks into the corresponding buffer areas according to the coding numerical values in sequence and judges whether the data blocks to be distributed are the same as the data blocks in the buffer areas or not, if so, the data blocks are substituted for the same data blocks in the buffer areas, and if not, the data blocks are added into the corresponding buffer areas; judging whether the data in the data block in the buffer area is complete or not, if so, outputting the data block to a data stream, and if not, not outputting the data block; the stream data valve eliminates data blocks with repeated and missing information, outputs a single and complete data block to the data stream, and ensures the unicity and the integrity of the data stream.

Drawings

FIG. 1 is a schematic diagram of data partitioning and encoding according to the present invention;

FIG. 2 is a schematic diagram of node allocation according to the present invention;

FIG. 3 is a block output flow diagram according to the present invention;

Detailed Description

The present invention will be described in detail with reference to the accompanying drawings and examples, and the present invention is not limited to the examples.

Referring to fig. 1, the present invention provides a large-scale-oriented vector field data processing method, which includes the following steps:

equally dividing large-scale vector field data, equally dividing each divided vector field data formed after the equally dividing again, repeating the steps until the number of equally dividing times reaches the dimension numerical value of the vector field data, finally equally dividing to form sub-regions, and coding in sequence according to the positions of the sub-region data; the specific segmentation rule in this embodiment is as follows: and equally dividing the large-scale vector field data according to the cross shape, further dividing the divided large-scale vector field data, and repeating the dividing step until the dividing quantity reaches the data dimension.

The specific encoding rule in this embodiment is: the codes of the sub-region data are labeled from left to right sequentially from top to bottom according to 00, 01, 10 and 11, the codes of the sub-region data after further segmentation keep the original labels and add suffix labels according to the above rules, for example, the codes of the sub-region data after first segmentation are 00, 01, 10 and 11, and the codes of the sub-region data after second segmentation are 0000, 0001, 0010, 0011, 0100 and 0101 … ….

Therefore, the data scale of large-scale vector field data is reduced, and the complexity of data communication is reduced;

(2) setting the maximum merging number, wherein the maximum merging number is smaller than the number of the subareas, reading all the subarea data, and sequentially merging the subarea data into the subarea blocks from small to large according to the maximum merging number. The sub-region data is read from the appointed position, and then a section of continuous sub-region data is read from the storage space according to the codes and is merged into a data block to be sent out, so that the whole large-scale vector field data does not need to be searched, the data transmission efficiency is increased, and the data is ensured to enter the data stream in the most efficient mode.

Referring to fig. 2, encoding is performed sequentially according to the formation of data blocks, the data blocks are read and allocated to corresponding partitions by hash mapping according to the encoding of the data blocks, and spatially adjacent data blocks are allocated to the same partition;

the hash mapping formula is

the number of the partitions is rounded down;

for example, there are 10 data blocks encoded from 0 to 9, the computing cluster includes three partitions from 0 to 2, and according to the above mapping relationship, the data block allocated to partition 0 is (0, 1, 2, 9), the data block allocated to partition 1 is (3, 4, 5), and the data block allocated to partition 2 is (6, 7, 8).

Aiming at the spatial relevance of vector field data, the data blocks adjacent to each other in space are distributed to the same partition, and the data blocks adjacent to each other in space do not need to be searched again, so that the efficiency of iterative computation is improved;

1) setting different buffer areas according to the coding values of the data blocks; wherein the buffer only stores data blocks of the same encoded prefix;

2) referring to fig. 3, sequentially allocating data blocks to corresponding buffers according to the encoding values and determining whether the data block to be allocated is the same as a certain data block in the buffer, if so, replacing the same data block in the buffer with the data block, and if not, adding the data block to the corresponding buffer; in the embodiment, the data block data structure is a key value name, a data type and a key value, and the specific method for judging whether the data block to be distributed is the same as the data block in the buffer area comprises the step of judging whether the key values in the data block to be distributed and the data block in the buffer area are the same.

3) Judging whether the data in the data block in the buffer area is complete, if so, outputting the data block, and if not, not outputting the data block, wherein the specific method for judging whether the data in the data block in the buffer area is complete in the embodiment is as follows: and sequentially checking whether all the minimum unit areas in the key values of the data block contain data values, if so, completing the data in the data block, and otherwise, completing the data in the data block.

The step (2) is specifically to judge the unicity of the data block, and the step (3) is specifically to judge the integrity of the data block.

The stream data valve eliminates data blocks with repeated and missing information, outputs a single and complete data block to the data stream, and ensures the unicity and the integrity of the data stream.

And performing iterative computation on the data blocks, namely computing by always using the result data of the last computation, and combining the computed data blocks to form a data stream for output. The iterative calculation ensures the data visibility among different data blocks, so that the calculation result is more accurate.

Claims

1. A large-scale-oriented vector field data processing method is characterized by at least comprising the following steps:

the hash mapping formula is

the number of the partitions is rounded down;

2. The large-scale-oriented vector field data processing method of claim 1, wherein: equally dividing into cross shapes and equally dividing in the step (1).

3. The large-scale-oriented vector field data processing method of claim 1, wherein: and (2) sequentially marking the codes of the sub-region data in the step (1) from left to right and from top to bottom according to 00, 01, 10 and 11, further equally dividing the codes of the sub-region data, keeping the original marks and adding suffix marks according to the rule.

4. The large-scale-oriented vector field data processing method of claim 1, wherein: step 1) the buffer only stores data blocks of the same encoding prefix.

5. The large-scale-oriented vector field data processing method of claim 1, wherein: and 2) the data block data structure is a key value name, a data type and a key value.

6. The large-scale-oriented vector field data processing method according to claim 1 or 5, wherein the specific method of whether the data block to be allocated in step 2) is the same as a certain data block in the buffer area is as follows: and judging whether the key values of the data block to be distributed and the data block in the buffer area are the same or not.

7. The large-scale-oriented vector field data processing method according to claim 1 or 5, wherein the specific method for judging whether the data in the data block in the buffer is complete in step 3) is as follows: and sequentially checking whether all the minimum unit areas in the key values of the data block contain data values, if so, completing the data in the data block, and otherwise, completing the data in the data block.