CN113111097A

CN113111097A - Method for realizing high-speed query of ocean data by using distributed database technology

Info

Publication number: CN113111097A
Application number: CN202110516187.0A
Authority: CN
Inventors: 韦广昊; 宋晓; 韩璐遥; 梁建峰; 刘志杰; 韩春花; 李维禄; 陈斐
Original assignee: NATIONAL MARINE DATA AND INFORMATION SERVICE
Current assignee: NATIONAL MARINE DATA AND INFORMATION SERVICE
Priority date: 2021-05-12
Filing date: 2021-05-12
Publication date: 2021-07-13

Abstract

The invention provides a method for realizing high-speed query of ocean data by using a distributed database technology, which comprises the following steps: s1, constructing a marine special query algorithm model taking MLlib as a prototype according to an In-database Analysis technology, and associating the query algorithm model with a query service requirement to obtain a service model; s2, when receiving the query request, associating the query request with a corresponding service model, and communicating the service model with the query algorithm model by adopting an asynchronous message mode to complete the query to obtain a query result; and S3, splitting the query result set and performing distributed output according to the small unit data result. The method disclosed by the invention performs maximum force fusion on the ocean professional algorithm model and the distributed database technology, and realizes the second-level response efficiency of the model result set.

Description

Method for realizing high-speed query of ocean data by using distributed database technology

Technical Field

The invention belongs to the technical field of database query, and particularly relates to a method for realizing high-speed query of ocean data by using a distributed database technology.

Background

Due to the characteristics of wide distribution, complex influence factors, uncontrollable change process and the like of the marine environment, along with the continuous superposition of scientific and technical development, the phenomena of multiple types of sensing equipment, complex resource system, huge subject coverage and the like appear. Therefore, for comprehensive utilization of marine environment data such as life cycle tracing, data value improvement, difference analysis, accurate query and the like of marine data, modern information technologies such as cloud computing, virtualization, big data, intelligent mining analysis and the like must be reasonably utilized, an ecological chain for efficient integrated analysis and utilization of value data and marine environment information resource data-information-knowledge-value is created from massive, multi-source, complex and multi-type marine environment data, and technical, methodical and platform management capabilities of the marine environment information resource are remarkably improved.

At present, aiming at the technical direction of distributed concurrent query by a distributed database system, the improvement of query performance is broken through from the multi-node support of system deployment and the cluster scale, and the application of the distributed concurrent query in the ocean field shows the performance reduction problem under a complex computing mode, and the problems are as follows: 1. the ocean data query has the comprehensive scheduling characteristics of multi-disciplinary and multi-type data, so a large amount of professional and complex computing requirements are met in a scene of ocean data query on the basis of an ocean comprehensive library. 2. The complex professional computing pressure brings great influence on the response capability of the database, and the efficiency of a large amount of interactive computing is difficult to meet the requirement of high query speed of a complex application scene.

Disclosure of Invention

In view of this, the present invention is directed to a method for implementing high-speed query of marine data by using a distributed database technology, so as to improve query efficiency.

In order to achieve the purpose, the technical scheme of the invention is realized as follows:

the method mainly aims to improve the efficiency of ocean service query, so that fusion is carried out on the aspect of a professional algorithm model of a scene for ocean data query, and the method realizes the high efficiency of query.

The method for realizing high-speed query of ocean data by utilizing the distributed database technology comprises the following steps:

s1, constructing a marine special query algorithm model taking MLlib as a prototype according to an In-database Analysis technology, and associating the query algorithm model with a query service requirement to obtain a service model;

s2, when receiving the query request, associating the query request with a corresponding service model, and communicating the service model with the query algorithm model by adopting an asynchronous message mode to complete the query to obtain a query result;

and S3, splitting the query result set and performing distributed output according to the small unit data result.

Further, in step S1, the business model directly associates the marine comprehensive library with the query algorithm model according to the business logic when it is constructed, so as to form a mapping relationship.

Further, in step S1, the query requirement of the ocean data may be a real-time query or a timing query, and the result of the timing query is stored in the database; when a service query request is received, if the service query request is a timing query, directly communicating a database to push results; if the query is real-time query, the query needs to be carried out to obtain the latest result, and the result is stored in the database for later use after the temporary table is directly split for result feedback.

Further, in step S2, after the message of the query request is issued, the query algorithm model responds to the received message, and at the same time, the message communication does not wait for the response, but performs multiple concurrent issuances of other messages.

Further, in step S3, the method performs rule splitting on the numerical intervals according to the intervals; and splitting the regions with different regions according to region rules.

Compared with the prior art, the method has the following advantages:

the method disclosed by the invention performs maximum force fusion on the ocean professional algorithm model and the distributed database technology, and realizes the second-level response efficiency of the model result set.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate an embodiment of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:

FIG. 1 is a block diagram of an overall architecture of a method for implementing high-speed query of marine data using distributed database technology according to an embodiment of the present invention;

FIG. 2 is a flowchart of a query algorithm model process according to an embodiment of the present invention;

fig. 3 is a flowchart of ocean data query according to an embodiment of the present invention.

Detailed Description

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.

The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.

The multi-dimensional query of marine environmental data is a main scene of marine business application, and the problem of low query response efficiency is mainly focused on the complexity and the specialty of the marine environmental data, and data interaction with a professional algorithm system is required to form a final query result, while the interaction of the system consumes a large amount of resources, and the optimal performance improvement target cannot be achieved.

The invention carries out the maximum force fusion of the ocean professional algorithm model and the distributed database technology, and realizes the second-level response efficiency of the model result set.

According to the invention, through constructing a marine query algorithm model module, on the basis of a distributed data processing technology, various subject resources deposited by a marine environment comprehensive library platform are mobilized In a multi-dimensional manner, a big data mining methodology and an In-Database machine learning algorithm are integrated to deposit a big data query capability, the efficient mapping of a marine environment service scene demand data result set is realized, and an efficient response innovation form of a query scene is formed.

step 1, constructing a marine special query algorithm model taking MLlib as a prototype according to an In-database Analysis technology, and defining the model as HYMLlib; associating the query algorithm model with a query service requirement to obtain a service model;

step 2, when receiving a query request, associating the query request with a corresponding service model, and communicating the service model with a query algorithm model by adopting an asynchronous message mode to complete query to obtain a query result;

after the query request message is issued, the query algorithm model responds to the received message, and simultaneously message communication can be performed without waiting for response, and multiple concurrent issuing of the 2 nd and 3 rd 3 … … th messages is performed, so that the problem of large resource occupation caused by a complex message sending, confirming and retransmitting message reciprocating mode is avoided, and the communication efficiency is improved.

When the inquiry service multi-message is issued respectively, asynchronous communication is carried out to the service model, the service model responds respectively according to the inquiry requirement and responds respectively according to different calculation and splitting speeds, and the communication efficiency of other messages is reduced without occupying channels due to waiting for response in message communication.

And 3, splitting the query result set, and performing distributed output according to the small unit data result.

For example, the result set fed back by the association impact query may be split according to the range of association values, for example, the split is performed according to four ranges of >1, 0.7-1, 0.45-0.6 and < 0.45.

Specifically, as shown in fig. 2, step 1 includes the following steps:

1. according to an In-database Analysis technology, constructing a marine special query algorithm model taking MLlib as a prototype, and defining the model as HYMLlib;

fusing a corresponding machine learning algorithm in the query algorithm model according to specific marine query service requirements to obtain a query algorithm model;

for example, through application of an algorithm of 'Apriori frequent item set-association rule' in a query algorithm model, a 'subject association degree influence analysis model' is trained according to marine business requirements to form an association analysis business model so as to support subject association query class data set output.

The ocean service query requirement can be real-time query or timing query, the result of the timing query is stored in a database, for example, algorithms such as Logistic regression and Apriori frequent item set-association rule are respectively associated according to the requirement of a service analysis model, the data of the comprehensive service library and the algorithms are bound and trained to obtain a query algorithm model, and the query result set falls into the comprehensive service library to form an independent analysis algorithm result library.

When the service is inquired, the database can be directly communicated to push the result if the service is timed, if the service is calculated by the communication between the database and the algorithm in real time, the latest result is formed, and the temporary table is directly split to feed back the result and then is put into a warehouse for standby.

2. Performing service association on the query algorithm model, query requirements and a data source to construct a service model;

and the construction of the business model is the only link for responding to the query requirement. The query requirement is directly communicated to the business model, and the business model is a logic model of a marine business theme base constructed according to the query business requirement and is business logic for decomposing and responding to the query requirement.

And directly associating the ocean comprehensive library with the algorithm model according to the business logic of the business model during construction, and forming a mapping relation with the ocean comprehensive library and the algorithm model.

For example, when the association of each subject of ocean business influences the query of the timed monthly report, the query command is firstly communicated to the corresponding business model, the business model calls a monthly report result set which is fixedly calculated every month by the ocean data comprehensive library and the Apriori frequent item set-association rule algorithm, the result set is split as required, and then the response business model is distributed to form a response foreground query interface after caching.

3. And realizing the operation of the corresponding query algorithm model through a UDF/UDAF programming interface form.

And (4) performing clustering, classification and other drills and precipitating a result set at regular time or in real time according to an updating rule by inquiring the database data associated with the algorithm model.

In step 2, a multilink asynchronous message scheduling mode is adopted, a multi-channel CPU resource scheduling and asynchronous message communication mode is focused, network overhead is saved, and the limit of marine environment data query response is upgraded.

And calling the multi-data concurrency capability based on a database distribution strategy, and starting from the requirement of promoting the quick response of the ocean query service.

In step 3, the query result set is split, so that the distributed computing mode of the small data table is achieved, the requirement of avoiding a large data set and saving network overhead is met. Specifically, the following rules may be set:

1) and forming a model according to the service scene requirements and the query algorithm according to configuration rules, wherein the configuration rules comprise classification attributes (classifying services or algorithms), priority attributes, model algorithm labels and the like.

2) And the query requirements of different service scenes are subjected to priority sequencing according to a configuration rule.

3) And splitting table distribution according to the result set configuration rule to form a result set small table mode.

For example, for a numerical interval, regular splitting is performed according to the interval; and (4) splitting the regions with differences according to region rules.

4) And matching the distribution strategies according to the classification attributes of the services.

A. Avoiding random distribution strategy

The invention avoids the default random distribution mode, and the model result data table is distributed by adopting a customized distribution strategy according to classification, thereby realizing the quick response capability on the service.

B. View-avoiding virtual table distribution strategy

The method avoids a common view virtual table mode, avoids the consumption of view conversion SQL statements during query, and improves the data query response capability by adopting a materialized view construction strategy.

Examples

1. And dividing the query service according to the marine business requirements, wherein the division comprises full query and customized query.

Wherein the content of the first and second substances,

full query requirement: the full query is used for querying the data volume, time and distribution condition of the elements. The correlation of the elements to be queried can be filtered by different conditions. Such as: by source, time range, space range, voyage, equipment code, survey unit, national and international, etc.

Customizing query requirements: the customized query is used for querying the data quantity, time and distribution condition of the element data table. The user can be enabled to autonomously select different conditions of each field of the table to query.

2. And performing flow planning according to the query business requirements of the ocean data, as shown in fig. 3.

3. And carrying out data table association according to the planned query service requirements to form a data model.

4. Constructing a marine query algorithm model and establishing a marine query algorithm model,

the example constructs a data monthly report model of timing query of Argo, GTSPP, WOD, ICODAS, GTS, NDBC, DBCP, NEAR-G00S, GLOSS, IOC water level, American ocean station, NGDC, IODP and the like.

According to different data sources, the data receiving conditions are reported monthly, so that a data source analysis monthly report model needs to be constructed, which is a service model.

When the business is inquired, the inquiry requirement is automatically associated to a 'data source analysis monthly report model', the business model establishes business logic when the business model is built according to the data content to be analyzed, the association of a data table and the association of an algorithm model are carried out according to the business logic, and the business logic and the algorithm model form calculation results and fall on the ground of the analysis report data of each month.

By means of a distributed database technology, a timed monthly report model is constructed, design input and output views are constructed in a materialized view mode, and a physical table is formed.

And outputting data according to the query requirement.

5. And according to the monthly report output requirement, algorithm construction such as K-Means clustering is fused, clustering mining is carried out according to history and related data, and a query result set is formed.

6. And matching the result set data with the corresponding distribution strategy according to the configuration rule, and automatically distributing the data to form a small table mode.

And 7, outputting the query result.

As shown in fig. 1, in this embodiment, a query analysis system sends a data analysis request to a database, the database performs distributed parallel computation to obtain a query analysis result, and the small data analysis result is returned to a terminal.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. The method for realizing high-speed query of ocean data by utilizing the distributed database technology is characterized by comprising the following steps:

2. The method of claim 1, wherein: in step S1, the business model directly associates the marine comprehensive library with the query algorithm model according to the business logic when it is constructed, so as to form a mapping relationship.

3. The method of claim 1, wherein: in step S1, the query requirement of the ocean data may be a real-time query or a timing query, and the result of the timing query is stored in the database;

when a service query request is received, if the service query request is a timing query, directly communicating a database to push results; if the query is real-time query, the query needs to be carried out to obtain the latest result, and the result is stored in the database for later use after the temporary table is directly split for result feedback.

4. The method of claim 1, wherein: in step S2, after the message of the query request is published, the query algorithm model responds to the received message, and at the same time, the message communication does not wait for the response, but performs multiple concurrent publications of other messages.

5. The method of claim 1, wherein: in step S3, rule splitting is performed on the numerical intervals according to the intervals; and splitting the regions with different regions according to region rules.