CN117852418B

CN117852418B - Deep flow velocity data reconstruction method in ocean based on geographic integrated machine learning

Info

Publication number: CN117852418B
Application number: CN202410257569.XA
Authority: CN
Inventors: 樊荣; 颜凤芹; 苏奋振; 贺彬; 王欣宜
Original assignee: Institute of Geographic Sciences and Natural Resources of CAS
Current assignee: Institute of Geographic Sciences and Natural Resources of CAS
Priority date: 2024-03-07
Filing date: 2024-03-07
Publication date: 2024-05-14
Anticipated expiration: 2044-03-07
Also published as: CN117852418A

Abstract

The invention relates to a deep flow velocity data reconstruction method in ocean based on geographic integrated machine learning, belonging to the technical field of ocean data prediction. Aiming at the problem that the current middle-deep ocean flow velocity data set cannot be corrected, lifted and reconstructed, the method comprises the following steps: 1) Acquiring data of an analysis area and preprocessing the data; 2) Rasterizing the analysis region to form region raster data, and performing space alignment on the preprocessed data and the region raster data to generate a multi-source flow grid data set; 3) Training and verifying the machine learning basic model to obtain a machine learning optimization model and predicting the ocean flow velocity of the analysis area based on the model; 4) And carrying out integrated promotion on the ocean flow velocity prediction result based on a geographic weighted regression algorithm. The method optimizes the deep flow velocity data result in the ocean, overcomes the defect of larger deviation when the traditional ocean model simulates the flow velocity, and effectively improves the coincidence degree of the simulation result and the reality result.

Description

Deep flow velocity data reconstruction method in ocean based on geographic integrated machine learning

Technical Field

The invention relates to the technical field of ocean data prediction, in particular to a deep ocean flow velocity data reconstruction method based on geographic integrated machine learning.

Background

The middle-deep ocean is a sea water layer between the surface layer and the deep sea, can influence global heat distribution and nutrient substance transmission, and plays a key role in the terrestrial climate system. The ocean current flow velocity of the deep ocean not only can influence the transportation and distribution of physical environment elements (such as temperature and salinity), biochemical environment elements (such as dissolved oxygen and chlorophyll) and marine organisms and the like, but also is an essential parameter for understanding the global turnover circulation mechanism and the physical process. Therefore, the reliable ocean flow velocity in the middle and deep layers is helpful for improving understanding of ocean current mechanisms and processes, and has important practical significance for coping with climate change.

The existing global middle-deep ocean flow velocity data is mainly simulated based on the overall physical state and dynamic process of the ocean, and is aided with data such as temperature, salinity, sea level height, sea ice concentration, sea ice movement, sea ice thickness and the like which are measured in the field and observed by satellites to assimilate the data. The data has the advantages that the data precision is to be improved due to the reasons of imperfect physical process, insufficient parameterization scheme, incomplete preliminary assumption, strong space-time difference of observed data and the like in the simulation process, the spatial distribution, the flow system structure and the flow field characteristics of a local sea area cannot be accurately simulated, and the current scientific needs and application demands cannot be met.

Machine learning is a multi-domain interdisciplinary, and relates to a plurality of disciplines such as probability theory, statistics and the like, and a computer is specially researched how to simulate or realize the learning behavior of human beings so as to acquire new knowledge or skills, and the existing knowledge structure is reorganized to continuously improve the performance of the machine learning. Currently, machine learning algorithms have been widely used in marine research and have successfully solved a series of problems such as improved simulation of surface flow, reconstruction of wave and tidal data sets, improved parameterization of ocean-mode physical processes, etc., but the above-mentioned prior studies have ignored correction, lifting and reconstruction of mid-deep ocean flow velocity data sets. Based on the method, the high-efficiency accurate medium-deep flow velocity data reconstruction method and the medium ocean flow velocity simulation with higher precision are beneficial to capturing the physical process of ocean motion more accurately, and have important significance for further understanding and prediction of ocean power process.

Disclosure of Invention

Aiming at the defects in the background technology, the invention aims to provide a method for reconstructing deep flow velocity data in the ocean based on geographic integrated machine learning, which can effectively improve the accuracy and the calculation efficiency of the deep flow velocity data in the ocean, improve the effectiveness of ocean simulation, enable the predicted ocean flow velocity to be more close to the actual result of the real deep ocean environment, and further serve future ocean state estimation and climate change simulation, thereby solving the problems in the background technology.

The technical aim of the invention is realized by the following technical scheme:

the first aspect of the present disclosure is: the method for reconstructing the deep flow velocity data in the ocean based on the geographic integrated machine learning comprises the following steps:

S1, data preparation: acquiring seasonal scale observation flow velocity data and ocean mode simulation flow velocity data of an analysis area, and preprocessing;

Spatial alignment of S2 data: dividing the whole analysis area by using a regular grid to form area grid data, and performing space alignment on the area grid data according to the preprocessed observation flow velocity data and the ocean mode simulation flow velocity data to generate a multi-source flow velocity grid data set;

S3, predicting the flow rate: inputting the multisource flow grid data set into a machine learning basic model for training and verification to obtain a machine learning optimization model, and predicting ocean flow velocity of an analysis area by using the machine learning optimization model;

s4, improving the prediction result: and (3) carrying out integrated promotion on the ocean flow velocity prediction result in the step (S3) based on a geographic weighted regression algorithm.

As a further preferable scheme of the above technical scheme: in step S1, the observed flow velocity data is middle-deep flow velocity data obtained by screening based on the observation depth of the Argo buoy as a screening parameter, wherein the observation depth is 950-1050 meters, specifically:

wherein i is the number of buoy equipment, j is the number of buoy observation period, 、/>The buoy after condition screening observes the east-west and north-south components of the flow rate, respectively.

Based on the above scheme, in step S1, the marine mode simulation flow rate data is obtained by assimilating and analyzing the observation flow rate data with HYCOM, and the spatial resolution of the marine mode simulation flow rate data is 1/12 °, and the temporal resolution is 1 hour.

Based on the above scheme, in step S1, format conversion, data screening and preprocessing of geographic coordinate system conversion are further performed on the observed flow rate data and the ocean mode simulation flow rate data.

As a further preferable scheme of the above technical scheme: in step S2, the step of generating a multisource flow rate mesh data set is as follows:

s201: performing season by season-section intra-scale accumulation and average processing on the observed flow velocity data to generate year-by-year seasonal observed flow velocity data;

s202: the method comprises the steps of obtaining simulated daily average flow velocity data of each year, calculating flow velocity data of each year of an analysis area by taking seasons as units by accumulation and summation, and generating annual seasonal ocean mode simulation flow velocity data, wherein the expression is as follows:

Wherein, Simulating flow velocity data values for annual seasonal marine modes,/>For the number of days contained in the season,/>、/>The east-west and north-south components of the flow rate are simulated for the ocean pattern on day i of the season, respectively.

As a further preferable scheme of the above technical scheme: in step S3, the process of training and verifying the machine learning model specifically includes:

S31: the method comprises the steps of taking ocean mode simulation flow velocity data as sample attributes, taking observation flow velocity data as a sample prediction result, and dividing training sample data and test sample data respectively;

S32: respectively inputting training sample data into different machine learning basic models for training, and verifying the trained machine learning basic models by using test sample data;

s33: and after the verification is passed, obtaining a machine learning optimization model.

Based on the above scheme, further, in step S32, when verifying the machine learning basic model, the following is specific:

Taking the observed flow rate as a true value, taking the simulated flow rate as a verification value, and verifying by using a quantitative index correlation coefficient R and a root mean square error RMSE, wherein the expressions are as follows:

Wherein, For the t-th buoy flow rate observation,/>For the t-th ocean mode flow velocity simulation value,/>The number of flow rate values that are included together for the observed flow rate and the simulated flow rate.

As a further preferable scheme of the above technical scheme: in step S3, when the machine learning optimization model is used to predict the ocean flow velocity, the maximum and minimum normalization processing is adopted for data input, and the machine learning optimization model has the frame as follows:

Wherein, The east-west component and the north-south component of the flow velocity output by the machine learning optimization model are respectively,The data normalized buoy observes the east-west component and the north-south component of the flow rate,The sea mode after condition screening simulates the east-west and north-south components of the flow rate, respectively.

As a further preferable scheme of the above technical scheme: in step 4, when the marine flow velocity prediction result is integrally improved, the integrated prediction capability of the flow velocity is improved by considering the spatial correlation between the geographic position and the predicted flow velocities of the plurality of machine learning models, which is specifically as follows:

Wherein, Representing coordinates at point i,/>Is an integrated predictive value of flow rate,/>、/>、/>Predicted values of the first, second and third machine learning base models,/>, respectivelyRepresenting intercept,/>、/>、/>The estimated regression coefficients representing the first, second, and third machine learning basis model predictions.

The second aspect of the present disclosure is: an electronic device comprising a processor and a memory having stored thereon computer instructions for executing the computer instructions stored on the memory to implement the steps of the in-sea deep flow data reconstruction method based on geointegrated machine learning of any of the above aspects.

The third aspect of the present disclosure is: a computer readable storage medium storing computer instructions for causing a computer to perform the steps of the deep in sea flow data reconstruction method based on geointegrated machine learning of any one of the above aspects.

Compared with the prior art, the technical scheme of the invention has the following beneficial effects:

1. The method of the invention screens the observed flow velocity data by taking the observed depth as a parameter, adopts a method of accumulating and averaging in the seasonal scale to generate a seasonal dataset every year, effectively integrates multiple sets of datasets, has rapid convergence rate and extremely high prediction efficiency by means of strong computing capacity and better generalization capacity of machine learning, fully realizes correction, reconstruction and promotion of the mid-deep ocean flow velocity dataset, can accurately reveal the dynamic change rule of ocean flow velocity, and provides more reliable mid-deep ocean flow velocity dataset for the fields of ocean science, meteorological prediction and the like.

2. According to the invention, the Argo buoy observation data and the ocean mode simulation data are fused to obtain the ocean flow rates of different space units under the target depth, so that the inversion accuracy of the middle-deep ocean flow rate is improved, and the defects of too slow current flow rate estimation of the existing ocean simulation flow rate data, complete dependence on physical principles and the like are greatly overcome; meanwhile, in consideration of the fact that the ocean mode simulation flow velocity value and the buoy observation flow velocity value are generally different greatly, the ocean mode simulation flow velocity data, the buoy observation flow velocity data and the analysis area grid data are subjected to data pair matching through operations such as coordinate system calibration and grid resampling, and the ocean mode simulation flow velocity data, the buoy observation flow velocity data and the analysis area grid data are used for realizing a follow-up reconstruction algorithm.

3. The method improves the prediction capacity of a plurality of machine learning models based on the spatial correlation between the geographic position of a geographic weighted regression algorithm and the predicted flow rates of the plurality of machine learning models, improves the spatial integration prediction capacity of the machine learning models, can accurately describe and predict the spatial distribution and variability of target variables, improves the reliability and accuracy of flow rate prediction, and reduces the instability of flow rate prediction; meanwhile, through the integrated prediction of the machine learning model, the nonlinear quantitative relation between the simulated flow velocity and the observed flow velocity established by the model can be promoted to the area with insufficient observation of the Argo buoy, the flow velocity data of the sparse sea area is filled, and more comprehensive and wide ocean flow field information is provided for an ocean management system.

Drawings

In order to more clearly illustrate the technical solution of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described.

FIG. 1 shows a schematic flow chart diagram of one embodiment of a geo-integrated machine learning based in-sea deep flow data reconstruction method of the present invention;

FIG. 2 shows a schematic flow chart of a specific method of one embodiment of the invention;

FIG. 3 shows an experimental index validation graph of one embodiment of the invention, wherein: the upper graph represents the R value and the lower graph represents the RMSE value.

Detailed Description

The following description of the embodiments of the present application will be made apparent and complete in conjunction with the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the present application. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application, but the present application may be practiced in other ways than those described herein, and therefore the scope of the present application is not limited to the specific embodiments disclosed below.

Referring to fig. 1-3, the invention provides a deep flow velocity data reconstruction method in ocean based on geographic integrated machine learning, which comprises the following steps:

Spatial alignment of S2 data: dividing the whole analysis area (namely rasterizing the analysis area) by using a regular grid to form area raster data, and performing space alignment with the area raster data according to the preprocessed observation flow velocity data and ocean mode simulation flow velocity data to generate a multi-source flow velocity grid data set;

More specifically, the overall implementation framework in this embodiment is shown in fig. 2, and is implemented by Python software programming; in step S1, seasonal-scale marine flow rate observation data and marine-mode simulation flow rate data of the analysis region are acquired by respectively:

The observation flow rate data comes from a global ocean observation system (plan) Argo (for real-time geostrophic oceanography), and the observation system can provide a plurality of parameters such as ocean temperature, salinity, pressure, flow rate, depth and the like by arranging a buoy in the ocean to measure physical, chemical and biological properties of the ocean in real time, and the observation system provides data support for near real-time observation of global ocean environment changes, prediction of climate changes and weather phenomena, ocean resource management, ocean ecological protection and the like.

Preferably, the mid-deep observation flow velocity data adopted in the embodiment is obtained based on inversion of track data (the time period is about 9 days) of an Argo buoy near a kilometer layer below a ocean surface layer, and the buoy observation flow velocity data is screened by taking a preset device observation depth as a screening parameter, wherein the preset device observation depth is 950-1050 meters, and the method specifically comprises the following steps:

In the selected buoy observation flow velocity data with set depth, the repeated value, the abnormal value, the outlier and the invalid value in the buoy observation flow velocity data and the missing value in the filling data are removed, and then the original drifting track information of the Argo buoy is analyzed, and the speed and the direction of the water flow are calculated by using the displacement and the time interval of the buoy in a set time and space range, so that the required observation flow velocity data is obtained.

The simulated flow velocity data is HYCOM re-analysis mode data (Hybrid Coordinate Ocean Model REANALYSIS DATA), which is a numerical mode developed based on an equal density surface mode MICOM, overcomes the defect of a vertical single equal density method of a basic mode in the aspect of vertical coordinate meshing, combines sigma coordinates and z coordinates, establishes a more practical and realistic hybrid vertical coordinate, has the advantage of being more suitable for a sea area with complex topography, and adopts a three-dimensional variation data assimilation algorithm 3DVAR (three dimensional variational scheme) system to assimilate and analyze buoy observation flow velocity data, so that mode re-analysis is realized to improve the consistency between a sea simulation field and an actual sea environment; the HYCOM mode data has a spatial resolution of 1/12 DEG and a temporal resolution of 1 hour.

Further, by utilizing Python, arcGIS, format conversion and data screening are carried out on the ocean mode simulation flow velocity data and the buoy observation flow velocity data, the space aggregation method is upscaled to the same low resolution through format conversion and is unified into a space vector shape format, and geographic coordinate system conversion is carried out on the ocean mode simulation flow velocity data and the buoy observation flow velocity data after the preprocessing.

In step S2, firstly, the analysis area is rasterized to form area raster data, then resampling is performed on the buoy observation flow velocity data and the ocean mode simulation flow velocity data after preprocessing according to the geographic grid size of the area raster data of the analysis area, and a spatial overlapping operation is performed based on the area raster data, so that the buoy observation flow velocity data and the ocean mode simulation flow velocity data are spatially aligned with the area raster data, and a multi-source flow rate raster data set is generated, which specifically includes:

1) Performing season by season-section intra-scale accumulation and average processing on the observed flow velocity data to generate year-by-year seasonal observed flow velocity data;

2) The seasonal process of the marine model simulation flow velocity data is as follows:

The method comprises the steps of obtaining daily middle-deep average flow velocity data of each year of ocean mode simulation, calculating flow velocity data values of each year of an analysis area by taking seasons as units in an accumulation and summation mode, and generating annual seasonal ocean mode simulation flow velocity data, wherein the expression is as follows:

More specifically, in step S3, in the generated matching (i.e., spatial alignment) data set of the year-by-year seasonal observation flow velocity data and the ocean mode simulation flow velocity data, the ocean mode simulation flow velocity data is used as a sample attribute, the buoy observation flow velocity data is used as a sample prediction result, 80% of the data is used as training sample data, the rest 20% of the data is used as test sample data, the training sample data is respectively input into different machine learning basic models for training, the test sample data is used for verifying the trained basic learning models, and the machine learning optimization model is obtained after verification is passed.

In the embodiment, three basic machine learning algorithm models are selected to predict the ocean flow velocity of the whole analysis area according to the fitting capacity, generalization capacity and parameter self-adaptation capacity of the algorithm, wherein the ocean flow velocity is Deep residual network, XGBoost and Random Forest algorithm respectively; when the number of network layers is deepened, the problem of precision reduction occurs in the traditional depth network, and compared with the traditional depth network algorithm, the Deep residual network algorithm can quickly and effectively transfer error information to each layer of the network by adding residual connection, so that the problem of error accumulation in the traditional depth network is reduced; the Deep residual network algorithm in this embodiment is optimized using the following loss function:

Wherein, To predict flow rate, i.e. target variable,/>For simulating flow rate, i.e. covariates,/>For parameter/>And/>Mapping function of/>A linear penalty regression model is trained for using the L1 and L2 norms as a priori regularization terms.

Different from the Bagging method for independently training the models, the latter model of the Boosting algorithm performs residual calculation based on the former model, and the overall deviation is gradually reduced by iterative training between the former model and the latter model and increasing the weights of samples which are misclassified by the weak model in the former round; the XGBoost algorithm optimizes on the basis of classical Boosting algorithm GBDT, including adding regularization terms in the loss function to suppress excessive complexity of the model, using weighted quantitive approximation to perform feature candidate segmentation point selection, using sparse perceptual algorithm to handle feature missing, etc., the XGBoost algorithm is widely applied to multiple fields due to its high efficiency and flexibility, while the XGBoost algorithm uses additional functions to obtain the final aggregate prediction, such as by adding the scores in the corresponding leaves of each tree, the prediction functions of the algorithm are:

Where k is the number representing the additive function, each Representing an independent tree,/>For the space of the regression tree, q represents the structure of each tree, where the regularization function can be expressed as:

Wherein, To slightly divide the loss function (calculate the difference between the prediction and the target),Representing a regular term, optimizing the target by using the second-order approximation of the Taylor series on the basis, and improving the optimization efficiency:

Wherein, And/>Is a first and second order gradient statistic.

The random forest algorithm is an integrated learning algorithm based on Bagging, the algorithm takes decision trees as a basic model, a self-help sampling method (bootstrap sampling) is adopted to select a sample subset in an overall training set, only one randomly selected feature subset is considered in the random forest in the node of each decision tree, and the final prediction of the algorithm can be expressed as follows:

where x is the input characteristic, and where, Is the kth decision tree, and the final integrated prediction is given by the average base learning tree.

Further, when the machine learning basic model is verified, the observed flow rate is taken as a true value, the simulated flow rate is taken as a verification value, and the quantitative index correlation coefficient R and the root mean square error RMSE are utilized for verification, wherein the expressions are respectively as follows:

The machine learning optimization model is obtained through the process, and when the machine learning optimization model is utilized to predict the ocean flow velocity, the data input adopts maximum and minimum value normalization processing, and the machine learning optimization model has the frame as follows:

In step S4, the integrated prediction capability of the marine flow rate prediction result is improved by considering the spatial correlation between the geographic location and the predicted flow rates of the plurality of machine learning models, specifically as follows:

The method provided by the invention further optimizes the deep flow velocity data result in the ocean, overcomes the defect of flow velocity deviation in ocean model simulation, and enables the predicted result to be more consistent with the actual result.

Proved by experimental indexes, as seen by combining the correlation coefficient R and the root mean square error RMSE indexes in the figure 3, the R indexes of the east-west component and the north-south component of the ocean middle-layer flow velocity are increased by 0.55-0.75, and R is respectively increased to 0.93 (corrected flow velocity east-west/north-south component) from original 0.38 (OFES flow velocity east-west component) and 0.18 (product flow velocity north-south component); the RMSE index is reduced by 3.34-3.83 cm/s, the RMSE is respectively reduced from original 6.51 (ECCO flow speed east-west component of product) and 5.82 (ECCO flow speed north-south component of product) to 2.68 (corrected flow speed east-west component) and 2.48 (corrected flow speed north-south component), and the R index and the RMSE index are greatly improved, so that the prediction result obtained by the method is more consistent with the actual result.

It should be noted that, steps in the present application may be sequentially adjusted, combined, and deleted according to actual needs, and although the present application is disclosed in detail with reference to the accompanying drawings, it should be understood that the descriptions are merely exemplary and are not intended to limit the application of the present application. The scope of the application is defined by the appended claims and may include various modifications, alterations and equivalents of the application without departing from the scope and spirit of the application.

Claims

1. The method for reconstructing the deep flow velocity data in the ocean based on the geographic integrated machine learning is characterized by comprising the following steps of:

s4, improving the prediction result: based on a geographic weighted regression algorithm, carrying out integrated promotion on the ocean flow velocity prediction result in the step S3;

in step S2, the step of generating a multisource flow rate mesh data set is as follows:

;

Wherein, Simulating flow velocity data values for annual seasonal marine modes,/>For the number of days involved in the season,、/>Simulating the east-west and north-south components of the flow rate for the ocean mode on the i-th day of the season, respectively;

s203: spatially aligning year-by-year seasonal observation flow velocity data, year-by-year seasonal ocean mode simulation flow velocity data with regional raster data to generate a multi-source flow grid dataset;

in step 4, when the marine flow velocity prediction result is integrally improved, the integrated prediction capability of the flow velocity is improved by considering the spatial correlation between the geographic position and the predicted flow velocities of the plurality of machine learning models, which is specifically as follows:

;

Wherein, Representing coordinates at point i,/>Is an integrated predictive value of flow rate,/>、/>、/>Predicted values of the first, second and third machine learning base models,/>, respectivelyRepresenting intercept,/>、/>、/>The estimated regression coefficients of the first, second and third machine learning basis model predictions are represented, respectively.

2. The method for reconstructing ocean deep flow velocity data based on geographic integrated machine learning according to claim 1, wherein in step S1, the observed flow velocity data is the deep flow velocity data obtained by screening based on the observation depth of the Argo buoy as the screening parameter, wherein the observation depth is 950-1050 meters, specifically:

;

3. The method for reconstructing deep flow velocity data in ocean based on geographic integrated machine learning according to claim 2, wherein in step S1, the ocean mode simulation flow velocity data is data obtained by assimilating and analyzing observation flow velocity data by using HYCOM, and the spatial resolution of the ocean mode simulation flow velocity data is 1/12 ° and the temporal resolution is 1 hour.

4. The method for reconstructing deep flow data in ocean based on geographic integrated machine learning according to claim 1, wherein in step S3, the process of training and verifying the machine learning model is specifically:

5. The method for reconstructing deep flow data in ocean based on geographic integrated machine learning according to claim 4, wherein in step S32, when verifying the machine learning base model, the method comprises the following steps:

;

6. The method for reconstructing deep flow data in ocean based on geographic integrated machine learning according to claim 5, wherein in step S3, when predicting the ocean flow by using a machine learning optimization model, the data input is normalized by a maximum and minimum value, and the machine learning optimization model is:

;

Wherein, The east-west component and the north-south component of the flow velocity output by the machine learning optimization model are respectively,/>The buoy after data normalization is used for observing the east-west component and the north-south component of the flow velocity respectively,/>The sea mode after condition screening simulates the east-west and north-south components of the flow rate, respectively.

7. An electronic device comprising a processor and a memory, wherein the memory has stored thereon computer instructions, the processor being configured to execute the computer instructions stored thereon to implement the steps of the geo-integrated machine learning based in-sea deep flow data reconstruction method as recited in any of claims 1-6.

8. A computer readable storage medium storing computer instructions for causing a computer to perform the steps of the method for reconstructing deep flow data in the ocean based on geointegrated machine learning of any one of claims 1-6.