CN113641497B

CN113641497B - System for realizing distributed high concurrency data summarization based on dimension reduction segmentation technology

Info

Publication number: CN113641497B
Application number: CN202110935417.7A
Authority: CN
Inventors: 周迅; 闫明明; 蔡超
Original assignee: Beijing Sanyi Sichuan Technology Co ltd
Current assignee: Beijing Sanyi Sichuan Technology Co ltd
Priority date: 2021-08-03
Filing date: 2021-08-16
Publication date: 2024-07-05
Anticipated expiration: 2041-08-16
Also published as: CN113641497A

Abstract

The method is realized based on the high concurrency data summarizing system, and comprises a data processing system and a computing node system; slicing the data according to the latitude of time, space and the like of the data, and processing the data concurrently, so that the more the data slices are, the more the data overall summarizing efficiency can be greatly improved along with the increase of the computing node servers; after the node server gathers all the data of the server, the data is subjected to dimension reduction, and the method greatly improves the data processing efficiency, improves the network utilization efficiency, reduces the network pressure and simultaneously reduces the requirement on server hardware.

Description

System for realizing distributed high concurrency data summarization based on dimension reduction segmentation technology

Technical Field

The invention relates to the technical field of atmospheric environment and computer processing, in particular to a system for realizing distributed high concurrency data summarization based on a dimension-reduction segmentation technology.

Background

In the distributed complex calpff model calculation process, a large amount of model calculation result data is generated. The data result calculated by the calpff model is multidimensional. For example, calculating a calpff case requires a time latitude, a coordinate latitude, and finally a contaminant latitude.

In general, the time period of one calpff case calculation is about one week, but can reach one year under certain specific calculation scenes; the calculated coordinate latitude is the number of model grids, which is generally ten thousand to tens of thousands, and can reach millions in special scenes; the calculated pollutant latitude, namely the number of pollutants, is generally 5, and hundreds of pollutants can be achieved in a special scene. Since the size of these data increases linearly with the complexity of the model and the time of computation, it can range from a few hundred MB to hundreds of GB. In this way, the summary and statistical analysis of the result data of the model calculation place high demands, especially for the transmission and summary of several hundred GB data in a distributed environment.

In the prior art, data transmission can be performed through various different protocols, each protocol is adapted to different scenes, and there are schemes suitable for large data transmission like TCP protocol, FTP protocol, etc. Although these protocols can solve the single problem of big data transmission, in the calculation scene of the distributed complex calpff model, multiple situations such as IO performance, calculation performance, memory use and the like need to be considered, and factors of hardware cost and calculation efficiency need to be comprehensively considered. Therefore, a single solution cannot effectively solve all the problems. Thus, the present solution combines the above problems, providing a solution that can be adapted to the above usage scenario, meeting the above requirements.

Disclosure of Invention

The invention provides a system for realizing distributed high concurrency data summarization based on a dimension-reduction segmentation technology, which aims to solve the problems that the prior art cannot solve the multiple conditions of IO, calculation efficiency, memory use and the like in a distributed complex CALPUFF model calculation scene when adopting a communication protocol for data transmission.

The system for realizing distributed high concurrency data summarization based on the dimension reduction segmentation technology comprises a data processing system and a computing node system;

The computing node system consists of a plurality of computing node servers, wherein the CALPUFF model computing cases are randomly distributed to the corresponding computing node servers, and each computing node server comprises a WRF computing model computing module, a CALMET model computing module and a CALPUFF model computing module; each computing node server performs slicing according to the result of model calculation, then performs local summarization, and then transmits the data which are locally summarized by each computing node server to a data processing system for final summarization;

the data processing system comprises a data summarizing module, a data analyzing module, a data restoring module and a picture rendering module;

the data summarizing module is used for summarizing the local summarized data transmitted by each computing node server in real time;

The data analysis module is used for carrying out secondary data analysis on the final calculation result of the CALMET and CALPUFF cases;

the data transfer module completes the persistent storage of various data; the data to be stored in a persistent manner comprises result data calculated by the CALMET and CALPUFF models and data subjected to secondary statistical analysis;

the picture rendering module renders the data into a picture with a formulated format; the data content to be rendered includes the final computed result data of CALMET and calpff cases.

The invention has the beneficial effects that:

1. The data processing efficiency is high. The invention adopts the data slicing and the grading summarization mode, thereby greatly improving the efficiency of data summarization. The number of the data fragments is determined according to the complexity of the CALPUFF model case, and under the condition that the number of the fragments is fixed, the more the computing node servers are, the shorter the time required for the data fragments to complete the primary summary is. Therefore, the time for data primary summarization can be greatly reduced by adding the computing node server. The data secondary summary is based on the concurrency setting of the data processing server, and since the data processing service is a single server, the concurrency number cannot be infinitely increased, but in general, the data secondary summary can be set to 70% of the number of the CPU of the server, for example, a server with 20 CPUs, and 14 ready-made data secondary summaries can be simultaneously performed. Therefore, the efficiency is greatly improved compared with the original mode. And the analysis processing time of the calculation result data of the complex CALPUFF model case is greatly reduced by integrating the concurrent quantity of the first stage and the second stage.

2. The data transmission efficiency is greatly improved. The calculation result of the complex calpff model can reach the data magnitude of hundreds of GB, so that large data transmission on the network needs to consume a great deal of time, and network blocking is easy to cause, so that the use of other systems in the network is affected. The size of the data can be greatly reduced in a data dimension reduction mode.

3. And the IO is greatly reduced. Because the invention adopts the data slicing mode, the number of each data slicing is not large, the data slicing can be directly loaded into the memory at one time, rather than decomposing the calculation result of the CALPUFF model into a plurality of small files through CALPOST, and then analyzing the small files one by one.

4. The peak value of the memory is greatly reduced. Because the scheme adopts the data slicing mode, in the data primary summarizing process of each computing node, the whole data slicing can be read into the memory, and the memory is cleaned after the whole data slicing is rapidly processed, so that the amount of the memory used by the system cannot be increased along with the increase of the slicing amount. The data volume of each data slice is not large, so that the requirement of the system on the use of the memory is greatly reduced.

5. The method of the invention greatly improves the data processing efficiency, improves the network utilization efficiency, reduces the network pressure, and simultaneously reduces the requirement on the hardware of the server.

Drawings

Fig. 1 is a schematic diagram of a system for implementing distributed high concurrency data aggregation based on a dimension reduction segmentation technique according to the present invention.

Detailed Description

With reference to fig. 1, the system for implementing distributed high concurrency data summarization based on the dimension reduction segmentation technology is described in this embodiment, and the system is implemented by a computing node system and a data processing system. The computation results of the complex calpff model are sliced and distributed into several compute node servers. The computing node servers firstly carry out primary summarization on the slice data of the sub-cases, and then transmit the data to the data processing server through a network. The data processing server performs secondary summarization on the primary summarized data so as to obtain a final data result, and then develops subsequent processing.

The computing node system comprises a plurality of computing node servers and is used for integrating complex computing flows, completing computing tasks of all links, locally summarizing computing results and then sending the computing results to the data processing servers for final summarizing. Each computing node server comprises a WRF computing model computing module, a CALMET model computing module and a CALPUFF model computing module. The respective modules will be described below.

The WRF (WEATHER RESEARCH Forecast) model is a new generation of mesoscale forecasting models and assimilation systems that are jointly engaged in development studies by many scientists at universities in the united states research division. The WRF mode is a fully compressible non-static mode, adopts an Arakawa C grid, combines an advanced numerical method and a data assimilation technology, adopts an improved multiple physical process parameter scheme, has multiple nesting and easy positioning capabilities at different geographic positions, integrates numerical weather forecast, atmospheric simulation and data assimilation into a mode system, and can better improve the simulation and forecast of medium-scale weather from meters to thousands of kilometers.

The WRF Preprocessing System (WPS) is a module consisting of three programs whose function is to prepare the input field for real data simulation. The respective uses of the three procedures were: geogrid determining a mode region (including longitude and latitude range, center point coordinates, grid nesting, grid number in horizontal direction and resolution size of the region) and interpolating topographic data (including topographic elevation, land utilization type, vegetation coverage, soil category and the like) of the static topographic data to the grid points; ungrib extracting a meteorological element field from GRIB format data; metgird interpolates the extracted meteorological element field levels to grid points determined by geogrid.

The WRF-ARW/NMM system is a core module of the WRF mode, has the advantages of high efficiency, easiness in mastering, capability of parallel operation and the like, the data is required to be processed by a real program before WRF programs are operated, meteorological element data interpolated by metgrid programs are identified by the real program of the WRF mode, required boundary layer files are generated after vertical interpolation to an eta layer of the WRF mode, and boundary condition initialization is performed.

The WRF model data comprises meteorological data and basic geographic data

The weather data refer to weather re-analysis data FNL data adopted for simulating historical weather, are global grid point data which are jointly manufactured and released by a national environmental forecast center (NCEP) and a national atmospheric research center (NCAR) in the NCEP official network, and adopt the most advanced global data assimilation system and perfect database at present to carry out quality control and assimilation treatment on observation data of various data sources (ground, ships, radiosonde, satellites and the like), so that a complete set of analysis data is obtained, and the method has the characteristics of more time, high density, strong continuity, high resolution, abundant content and the like, and can effectively make up the defects of conventional observation data in the aspect of disastrous weather analysis.

The FNL data has a spatial resolution of 1 DEG x1 DEG and a temporal resolution of 6 hours. Four times 0, 6, 12, 18 of world time per day were subjected to a global data analysis. The data content comprises the content of air pressure, temperature, relative humidity, rainfall and the like. The data format is divided into GRIB and GRIB, wherein GRIB data time is 1999.07.30 to 2007.12.06, GRIB2 data time is 2007.12.06 and is continuously updated. The weather data format provided this time is GRIB < 2 >.

The WRF mode base geographic data includes: terrain elevation, land use type, other underlying surface information, and the like. Terrain data GTOP030, resolution 30", land use type data USGS, MODIS, USGS contains 24 types of land types, MODIS contains 20 land types, satellite land cover product data, resolution is up to 30". Other underlying data includes vegetation type, soil moisture, soil texture, and the like.

The WRF mode operation setting comprises basic data input, control file setting and physical parameter setting, wherein the basic data comprise data such as topography, land utilization type and the like, weather forecast grid point data GFS, historical weather analysis data FNL and ground and high altitude observation data. The simulation settings comprise basic simulation settings such as nested grid range, map projection, simulation time and the like, the WRF mode provides various physical parameterization schemes, and proper schemes are selected, so that simulation and forecast of the medium-scale weather can be improved better.

The CALMET model is a diagnostic wind farm calculation model developed by SIGMA RESEARCH Corporation (now a subsidiary of EARTH TECH, inc) recommended by the united states EPA. The CALMET is a meteorological module for describing an hour wind field and a temperature field in a three-dimensional grid simulation domain by utilizing a mass conservation continuous equation, the core part of the meteorological module comprises a diagnosis wind field and a microclimate field mode, the diagnosis wind field module carries out terrain dynamics, slope flow and terrain blocking effect adjustment on an initial guessed wind field (a mesoscale mode output wind field, conventionally monitored ground and high-altitude image data) to generate a first step wind field, and the first step wind field is used for guiding observation data and generating a final wind field through interpolation, smoothing, vertical velocity calculation, divergence minimization and the like. The CALMET module takes the dynamic influence of terrain, oblique airflow and blocking effects into account in detail in the three-dimensional wind field simulation process.

The calpff model is an air quality diffusion model recommended by the united states environmental protection agency EPA (Environmental Protection Agency) and developed by SIGMA RESEARCH Corporation (now a subsidiary of EARTH TECH, inc), and consists of a CALMET weather module, a calpff smoke mass diffusion module and a CALPOST post-treatment module, and is a gaussian smoke mass diffusion model for simulating unstable multi-layer and multi-species pollution (such as SO ₂、NO_x and the like), and the migration diffusion process, the dry-wet sedimentation process and the basic chemical conversion process of different pollutants under the weather conditions of time and space change are considered. Considering the influence of complex terrain, water transmission, coastal boundary influence and subsidence influence of buildings, the concentration and the subsidence amount at preset points can be estimated by simulating pollutants discharged from a source in a advection diffusion mode.

Calpff mode computation is mainly divided into two parts: the system comprises a CALMET weather processing module and a CALPUFF smoke mass diffusion module, wherein the CALMET module is used for generating weather field files required by a CALPUFF main module. The calpff module is the main module of the model and is a multi-level Gaussian (Gaussian) diffusion model for simulating or predicting a variety of pollutants under unsteady and unsteady conditions. The concentration diffusion calculation can be carried out through the module, but related data of a pollution emission source is required to be input externally, SO that the mass concentration distribution condition of pollutants (such as SO ₂, NOx and the like) under meteorological factors which change along with time and space positions is obtained.

The data processing system is a data processing server and mainly aims at processing and analyzing the calculation result. Model calculations result in a large number of data files, and the results of a complex case often contain billions of data. Data processing systems need real-time, efficient processing to analyze these data results and to spool the data into a data persistence layer. Then, the result data needs to be subjected to secondary analysis, and the result data needs to be rendered into pictures and other functions. The system comprises a data summarizing module, a data analyzing module, a data transfer module and a picture rendering module;

The main function of the data summarizing module is summarizing data. A complex case can produce a large number of sub-cases that can be randomly assigned to the model computing systems of the various nodes. After all distributed calculation is completed, the model calculation system of each node performs local summarization, and then needs to send the final summarization to the data processing system. Therefore, the main task of the data summarizing module is to accept the locally summarized data from the model computing system of each computing node, and then summarize the received data in real time.

The data analysis module is mainly used for carrying out secondary data analysis on the final calculation result of the CALMET and CALPUFF cases. Such as the contribution rate of the pollution source, the maximum floor concentration, wind speed and direction statistical analysis, etc.

The data transfer module mainly has the function of completing the persistent storage of various data. The data that needs to be persisted includes the result data of CALMET and calpff model calculations, as well as data of secondary statistical analysis.

The main function of the picture rendering module is to render data into pictures in a formulated format. The data content to be rendered includes the final computed result data of CALMET and calpff cases.

According to the high concurrency data summarizing method, CALPUFF model calculation cases are randomly scattered on different calculation node servers, each node performs local summarization according to model calculation results, and after data summarization of a certain node is finished, the data of the node are summarized on a data processing server to perform final summarization. The data dimension reduction refers to changing an original multidimensional data structure into one-dimensional data according to a specific coding mode, only retaining the final concentration value of pollutants in each grid, and removing the limit values of other dimensions.

According to the embodiment, the data can be processed in a concurrent manner in a slicing manner, the data is sliced according to the time dimension, the space dimension and the like of the data, and the data can be processed in a concurrent manner, so that the more the data slices are along with the increase of the computing node servers, the efficiency of the overall data summarization can be greatly improved. After the node server gathers all the data of the server, the data is subjected to dimension reduction, namely, the data streams in all the dimensions are spliced into one-dimensional data streams according to the format of the formulated data streams in each dimension, and then the one-dimensional data streams are transmitted to the data processing server. When the data processing server receives the one-dimensional data stream, the data is restored into multidimensional data through the agreed data format. The data processing server receives one-dimensional data of a plurality of computing node services, and performs data conversion and secondary summarization in real time every time data of one computing node is received, so that the data of all the computing node servers can be summarized secondarily. After all the data are summarized, the data processing server can perform subsequent data analysis and statistics and picture rendering.

The embodiment provides a way to segment and dimension down data to transmit and aggregate data. In the data slicing, in the present embodiment, in the process of calculating a complex calpff model in a distributed manner, the data calculated by the sub-cases are randomly dispersed to different calculation node servers, each node performs local summary according to the result of model calculation, and when the summary of data of a certain node is completed, the data of the node is summarized to the data processing server, so as to perform final summary. The data dimension reduction is to change the original multidimensional data structure into one-dimensional data according to a specific coding mode, only the final concentration value of pollutants in each grid is reserved, and the limit values of other dimensions are removed.

Claims

1. A system for realizing distributed high concurrency data summarization based on a dimension reduction segmentation technology is characterized in that: the system for summarizing the data comprises a data processing system and a computing node system;

After each computing node server performs local summary, the data is subjected to dimension reduction, namely: splicing the data streams of all dimensions into one-dimensional data streams according to the format of the formulated data stream of each dimension, and then transmitting the one-dimensional data streams to a data processing system; when the data processing system receives the one-dimensional data stream, the data is restored into multidimensional data through the appointed data format;

The data processing system receives one-dimensional data of a plurality of computing node servers, and performs data conversion and secondary summarization in real time;

the data analysis module is used for carrying out secondary data analysis on the result of final case calculation output by the CALMET model calculation module and the CALPUFF model calculation module;

The data transfer module completes the persistent storage of various data; the data to be stored in a lasting way comprises result data of the case and data of secondary statistical analysis output by a CALMET model calculation module and a CALPUFF model calculation module;

The picture rendering module renders the data into a picture with a formulated format; the data content to be rendered comprises result data which is finally calculated by the case and is output by the CALMET model calculation module and the CALPUFF model calculation module.

2. The system for implementing distributed high concurrency data aggregation based on the dimension reduction splitting technique of claim 1, wherein: the secondary data analysis comprises contribution rate to pollution sources, maximum landing concentration and wind speed and wind direction statistical analysis.