CN114138857A

CN114138857A - Big data mining method and device based on watershed water environment

Info

Publication number: CN114138857A
Application number: CN202111329268.6A
Authority: CN
Inventors: 王国强; 薛宝林; 王溥泽; 彭岩波
Original assignee: Beijing Normal University
Current assignee: Beijing Normal University
Priority date: 2021-11-10
Filing date: 2021-11-10
Publication date: 2022-03-04

Abstract

The invention relates to the technical field of environmental data processing, in particular to a method and a device for mining big data based on a watershed water environment. The method comprises the following steps: acquiring original data from each application through an interface access layer; preprocessing original data through a data acquisition ETL platform to obtain input data, and inputting the input data into a pre-trained data mining model; calculating a data model index based on a first data mining module; based on a second data mining module, carrying out keyword extraction and abstract extraction on the text data, and carrying out structural processing on the information of the text data; acquiring and storing mining data obtained through a data mining model; and when receiving a query request corresponding to the mining data, feeding back the mining data through the data encapsulation exchange interface. By adopting the method and the system, the knowledge discovery of the economic society, the meteorological hydrology, the water environment and other cross-fields can be carried out through the big data mining technology, and the technical support is provided for realizing the intelligent basin management.

Description

Big data mining method and device based on watershed water environment

Technical Field

The invention relates to the technical field of environmental data processing, in particular to a method and a device for mining big data based on a watershed water environment.

Background

Data mining is a process of extracting information and knowledge hidden in massive, incomplete, noisy, fuzzy and random actual data, which is unknown to people but potentially useful, and is an important means for mining knowledge from a database and acquiring decision support key data. The algorithm research on data mining at home and abroad is relatively deep, and comprises association rules, data classification, clustering rules and the like. In the aspect of data classification technology, various methods such as a decision tree and a neural network are formed;

at present, the amount of wading environment management service data (such as water environment monitoring data, environment statistical data and wastewater discharge monitoring data) and social, economic, hydrological, water resource, meteorological data and the like related to the wading environment management service data continuously increases, but as the wading management departments are numerous and lack of overall coordination, the traditional informatization construction is dispersedly and independently carried out by each department, and numerous data isolated islands are formed. Deep processing of data resources is not sufficient, statistical association, logical association and even mechanism association among various types of data are not discovered, and on the basis of big data collection and integration, cross-field knowledge discovery of economic society, meteorological hydrology, water environment and the like through a big data mining technology is urgently needed, so that technical support is provided for intelligent basin management.

Disclosure of Invention

The embodiment of the invention provides a method and a device for mining big data based on a watershed water environment. The technical scheme is as follows:

on one hand, a big data mining method based on a watershed water environment is provided, and the method is realized by a big data mining platform, and comprises the following steps:

acquiring original data from each application through an interface access layer;

preprocessing the original data through a data acquisition ETL platform to obtain input data meeting the model standard, and inputting the input data into a pre-trained data mining model; the data mining model is divided into a first data mining module facing service evaluation and a second data mining module facing text analysis;

calculating a data model index based on the first data mining module; the data model indexes are divided into 4 types, namely a section water quality evaluation type, a water quality index calculation type, a water environment bearing capacity evaluation type and a water ecological safety evaluation type;

based on the second data mining module, carrying out keyword extraction and abstract extraction on the text data, and carrying out structural processing on the information of the text data;

acquiring and storing mining data obtained through the data mining model;

and when receiving a query request corresponding to the mining data, feeding back the mining data through the data encapsulation exchange interface.

Optionally, the preprocessing the raw data by the data acquisition ETL platform includes:

and performing data cleaning, data format conversion, data completion and data quality management on the original data through a data acquisition ETL platform.

Optionally, the fracture water quality evaluation types include river water quality evaluation, lake and reservoir eutrophication evaluation, surface water drinking water quality evaluation, groundwater drinking water quality evaluation, near shore sea area water quality evaluation, and regional water quality evaluation.

Optionally, the water quality index calculation types include water quality index calculation, water quality comprehensive pollution index, urban water quality index calculation, and Yangtze river economic zone region comprehensive standard exceeding index data calculation.

Optionally, the water environment bearing capacity evaluation type includes a Yangtze river economic area and water environment bearing capacity evaluation, an ecological environment pressure evaluation, an ecological system health evaluation, an ecological service function evaluation, and an ecological risk evaluation.

Optionally, the water ecological safety assessment type comprises a water ecological safety assessment.

Optionally, based on the second data mining module, performing keyword extraction on the text data, including:

based on the TextRank algorithm, the text is divided into a plurality of composition units, a graph model is established, important components in the text are sequenced by using a voting mechanism, and keyword extraction is carried out on the text data.

Optionally, based on the second data mining module, performing summary extraction on the text data, including:

searching in the data according to a Query statement of the text data to obtain a plurality of search results;

performing morpheme analysis on the text data to generate a plurality of morphemes;

for each search result, calculating a relevance score of each morpheme and each search result;

and carrying out weighted summation on the relevance scores of the morphemes relative to the search results to obtain the relevance scores of the Query sentences and the search results, and carrying out abstract extraction on the text data according to the relevance scores of the Query sentences and the search results.

Optionally, the structuring the information of the text data includes:

carrying out structuring processing on the information of the text data, searching geographic position information in the mining data by adopting a word segmentation technology based on combination of rules and statistics based on a water environment word segmentation dictionary, and positioning through an electronic map;

and performing classified display on the mining data according to the screening conditions.

On the other hand, the device is applied to the big data mining method based on the watershed water environment, and comprises the following steps:

the acquisition module is used for acquiring original data from each application through the interface access layer;

the preprocessing module is used for preprocessing the original data through a data acquisition ETL platform to obtain input data meeting the model standard and inputting the input data into a pre-trained data mining model; the data mining model is divided into a first data mining module for service evaluation and a second data mining module for text analysis;

the calculation module is used for calculating a data model index based on the first data mining module; the data model indexes are divided into 4 types, namely a section water quality evaluation type, a water quality index calculation type, a water environment bearing capacity evaluation type and a water ecological safety evaluation type;

the extraction module is used for extracting keywords and abstracts from the text data based on the second data mining module and carrying out structural processing on the information of the text data;

the storage module is used for acquiring and storing the mining data obtained by the data mining model;

and the query module is used for feeding back the mining data through the data encapsulation exchange interface when receiving a query request corresponding to the mining data.

In another aspect, a big data mining platform is provided, and the big data mining platform comprises a processor and a memory, wherein at least one instruction is stored in the memory, and the at least one instruction is loaded and executed by the processor to implement the big data mining method based on the watershed water environment.

In another aspect, a computer-readable storage medium is provided, in which at least one instruction is stored, and the at least one instruction is loaded and executed by a processor to implement the method for big data mining based on the watershed water environment.

The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:

acquiring original data from each application through an interface access layer; preprocessing the original data through a data acquisition ETL platform to obtain input data meeting the model standard, and inputting the input data into a pre-trained data mining model; the data mining model is divided into a first data mining module facing service evaluation and a second data mining module facing text analysis; calculating a data model index based on the first data mining module; the data model indexes are divided into 4 types, namely a section water quality evaluation type, a water quality index calculation type, a water environment bearing capacity evaluation type and a water ecological safety evaluation type; based on the second data mining module, carrying out keyword extraction and abstract extraction on the text data, and carrying out structural processing on the information of the text data; acquiring and storing mining data obtained through the data mining model; and when receiving a query request corresponding to the mining data, feeding back the mining data through the data encapsulation exchange interface. Therefore, the method can surround the water environment management target, take hydrology, water resources, water environment, meteorology, social economy and other big data as analysis objects, generalize and analyze the mining requirements of the watershed water environment data from the aspect of evaluation decision and service management, determine the data mining theme and target by combining the time characteristics and the space characteristics of the water environment management service, construct a data mining service model taking the application scenes of current state analysis, cause analysis, traceability analysis, potential evaluation, anomaly identification, trend early warning and the like as analysis objects, and realize data mining.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is an implementation environment diagram of a big data mining method based on a watershed water environment according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method for mining big data based on a watershed water environment according to an embodiment of the invention;

FIG. 3 is a block diagram of a big data mining device based on a watershed water environment according to an embodiment of the invention;

fig. 4 is a schematic structural diagram of a large data mining platform according to an embodiment of the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments.

The embodiment of the invention provides an implementation environment of a large data mining method based on a watershed water environment, as shown in fig. 1, the implementation environment at least comprises a large data mining platform, and the large data mining platform can comprise an interface access layer, a data base layer, a data service layer, a data calculation layer, a data acquisition layer and a source data storage layer;

the interface access layer is used for acquiring original data from a plurality of applications;

the data base layer is used for carrying out authority verification, safety verification, resource management and service management;

the data service layer is used for calculating data model indexes, performing semantic analysis and managing metadata;

the data calculation layer performs data calculation based on hadoop, Spark and HBase;

the data acquisition layer performs data acquisition and cleaning conversion based on Sqoop and Kafka;

the data storage layer stores data based on SQL Server, MySQL and Oracle.

Wherein, the metadata is a data source, a data warehouse and a data application which are opened, and a complete link from generation to consumption of the data is recorded. The metadata contains static table, column, partition information (i.e., MetaStore). Dynamic task, table dependence mapping relation; model definition and data life cycle of a data warehouse; and metadata such as ETL task scheduling information, input and output are the basis for data management, data content, data applications. The whole big data system is based on metadata, and without a set of complete metadata design, the problems that the data is difficult to track, the authority is difficult to control, the resources are difficult to manage, the data is difficult to share and the like occur.

ETL, Extract-Transform-Load, is used to describe the process of extracting (Extract), converting (Transform), and loading (Load) data from a source to a destination. The term ETL is more commonly used in data warehouses, but its objects are not limited to data warehouses. The ETL platform plays an important role in data cleaning, data format conversion, data completion, data quality management and the like. As an important data cleansing middle layer, ETL should support a variety of data sources, such as a messaging system, a file system, etc.

And establishing a data model according to a service scene and a service rule, converting and cleaning the multi-source data set acquisition into input data meeting the standard of the data model, performing calculation training optimization on the model, and outputting an index calculation result. Aiming at service data with complex water environment, according to a data mining target, 5 types of 20 data model indexes are established for a data mining tool, and the indexes comprise: evaluating the water quality of the section, calculating a water quality index, evaluating the bearing capacity of the water environment, evaluating the safety of the water ecology and carrying out semantic analysis.

Most data queries are driven by requirements, one requirement develops one or more interfaces, interface documents are written, and the interface documents are opened to be called by a service party.

The embodiment of the invention provides a basin water environment-based big data mining method, which can be realized by a big data mining platform. As shown in fig. 2, a flow chart of a big data mining method based on watershed water environment, a processing flow of the method may include the following steps:

step 201, obtaining original data from each application through an interface access layer.

Step 202, preprocessing the original data through a data acquisition ETL platform to obtain input data meeting the model standard, and inputting the input data into a pre-trained data mining model.

The data mining model is divided into a first data mining module facing business evaluation and a second data mining module facing text analysis.

Optionally, the raw data is preprocessed by the data acquisition ETL platform, including:

And step 203, calculating the data model indexes based on the first data mining module. The data model indexes are divided into 4 types, namely a section water quality evaluation type, a water quality index calculation type, a water environment bearing capacity evaluation type and a water ecological safety evaluation type.

Alternatively, the fracture water quality evaluation types include river water quality evaluation, lake and reservoir eutrophication evaluation, surface water drinking water quality evaluation, groundwater drinking water quality evaluation, near shore sea area water quality evaluation, and regional water quality evaluation.

Optionally, the water environment bearing capacity evaluation type includes a Yangtze river economic area and water environment bearing capacity evaluation, ecological environment pressure evaluation, ecological system health evaluation, ecological service function evaluation and ecological risk evaluation.

Each index is explained specifically below:

1. river water quality evaluation

(1) Calculating evaluation index concentration standard exceeding index (R)

The indexes involved in evaluation adopt 22 indexes except water temperature and fecal coliform in a 'surface water environmental quality standard' (GB3838-2002) table 1, and the indexes comprise: pH, dissolved oxygen, permanganate index, biochemical oxygen demand, ammonia nitrogen, petroleum based, volatile phenols, mercury, lead, total nitrogen (not rated in the river section), total phosphorus, chemical oxygen demand, copper, zinc, fluoride, selenium, arsenic, cadmium, chromium (hexavalent), cyanide, anionic surfactants, and sulfides.

The evaluation index concentration superstandard index (R) is reflected by the comparison value of the evaluation index concentration monitoring value and the index standard concentration limit value corresponding to the section target, and the calculation method is shown in the formulas (2-1) to (2-5). And (3) taking the maximum overproof index value in the section evaluation indexes as the overproof index (R) of the section by adopting a short plate effect, wherein the corresponding index is the primary pollution index of the section.

A single water quality index overproof index (R):

in the formula, R represents an overproof index; c represents the actually measured concentration value, mg/L; s represents an evaluation standard limit value, mg/L;

② exceeding index (R) of dissolved oxygen_DO)：

When C is present_DO≥S_DOWhen the temperature of the water is higher than the set temperature,

when C is present_DO<S_DOWhen the temperature of the water is higher than the set temperature,

in the formula, R_DOA contamination index representing dissolved oxygen; c_DOThe measured concentration value of the dissolved oxygen is expressed as mg/L; s_DOThe evaluation standard limit value of the dissolved oxygen, mg/L; c_DO,fRepresents the saturated dissolved oxygen concentration, mg/L.

③ pH value over standard index (R)_pH)：

When the pH value is less than or equal to 7,

when the pH is higher>When the number of the holes is 7,

in the formula, R_pHA contamination index representing a pH value; c_pHRepresents the measured value of pH; s_pHdRepresents the lower limit of pH in the evaluation criteria; s_pHuThe upper limit of pH in the evaluation criteria is shown.

(2) Superscalar type determination

According to the technical scheme, the method comprises the following steps that R is more than 0.2, 0.2 is more than or equal to R0, 0 is more than or equal to R-0.2, and R is less than or equal to-0.2, the quality evaluation result of the water environment of the section (point location) is divided into three types, namely, the pollution index concentration is seriously overproof, close to overproof and not overproof.

(3) Early warning level determination

In order to improve the accuracy of surface water environment quality early warning and management measure formulation, process evaluation of reaction water quality change is introduced into a section standard exceeding type, and early warning types are divided according to the rising and the lowering of the concentration of a primary pollution index.

The serious overproof sections and the overproof sections with the first pollution index overproof indexes rising are defined as red early warnings, the overproof sections with the first pollution index overproof indexes not rising are defined as orange early warnings, the approximately overproof sections with the first pollution index overproof indexes rising are defined as yellow early warnings, the approximately overproof sections with the first pollution index overproof indexes not rising are defined as blue early warnings, and the sections without overproof indexes are defined as no early warnings.

2. Evaluation of lake water quality

The lake and reservoir water quality evaluation algorithm is based on the 'surface water environment quality standard' (GB3838-2002), 22 items except water temperature and fecal coliform group bacteria in the table 1 are selected as evaluation indexes, a single-factor evaluation method is adopted, namely, the 22 indexes are evaluated one by one according to standard limit values, one item with the highest category in the evaluated indexes is selected as a water quality grade of a section, evaluation of the surface lake and reservoir monitoring section water quality grade is realized, and the evaluation result comprises the water quality grade of each single index and the water quality grade of the section.

3. Evaluation of lake and reservoir eutrophication

(1) The nutrient state index calculation formula of each item is as follows:

TLI(chla)＝10(2.5+1.086lnchla) (2-6)

TLI(TP)＝10(9.436+1.624lnTP) (2-7)

TLI(TN)＝10(5.453+1.694lnTN) (2-8)

TLI(SD)＝10(5.118-1.94lnSD) (2-9)

TLI(C_ODMn)＝10(0.109+2.661lnC_ODMn) (2-10)

in the formula: chla unit is mg/m3, SD unit is m; other index units are mg/L.

(2) The comprehensive nutritional state index calculation formula is as follows:

in the formula: TLI (Sigma) represents the integrated nutrient status index; wj represents the relative weight of the nutritional status index of the jth parameter; TLI (j) denotes the index for nutritional status which represents the j-th parameter.

With chla as the reference parameter, the normalized correlation weight calculation formula of the jth parameter is:

in the formula: r is_ijRepresenting the correlation coefficient of the jth parameter and the reference parameter chla; m represents the number of evaluation parameters.

4. Surface water drinking water quality evaluation

The drinking water quality evaluation-surface water evaluation algorithm is based on the 'surface water environment quality standard' (GB3838-2002), 22 items (except for water temperature and faecal coliform group) in 24 items of basic items in table 1 are selected (except for total nitrogen of rivers), 5 items of supplementary items of a centralized domestic drinking water surface water source in table 2 and 80 items of specific items of a centralized domestic drinking water surface water source in table 3 are selected as evaluation indexes, a single-factor evaluation method is adopted, namely, the grades are evaluated one by one according to 107 indexes of standard limit values, the highest-class one of the evaluated indexes is selected as the water quality grade of a section, the water quality grade evaluation of the monitoring section of the surface water river type and the lake reservoir type drinking water source is realized, and the evaluation result comprises the water quality grade of each single index and the water quality grade of the section.

5. Groundwater drinking water quality evaluation

The drinking water quality evaluation-underground water evaluation algorithm is based on the underground water quality standard (GB/T148480-2017), 39 conventional indexes in the table 1 and 54 unconventional indexes in the table 2 are selected as evaluation indexes, a single-factor evaluation method is adopted, namely, the indexes are evaluated one by one according to standard limit values to 93 indexes, one with the highest category in the indexes is selected as the water quality grade of a section, the evaluation of the water quality grade of the underground water type drinking water source monitoring section is realized, and the evaluation result comprises the water quality grade of each single index and the water quality grade of the section.

6. Water quality evaluation in offshore area

The offshore area water quality evaluation algorithm is based on seawater quality standards (GB3097-1997), 38 indexes except water temperature in the table 1 are selected as evaluation indexes, a single-factor evaluation method is adopted, namely the 38 indexes are evaluated one by one according to standard limit values, one with the highest category in the indexes is selected as the water quality grade of a section, the evaluation of the offshore area monitoring section water quality grade is realized, and the evaluation result comprises the water quality grade of each single index and the water quality grade of the section.

7. Regional water quality assessment

(1) Regional water quality assessment

When the total number of the sections in the evaluation area is less than 5, calculating the arithmetic mean value of the evaluation index concentrations of all the sections, then evaluating the water quality of the sections, and determining the water quality condition of the area according to the water quality types of the sections in the following table.

TABLE-1 Water quality classes

When the total number of the sections in the evaluation area is more than 5 (including 5), a section water quality type proportion method is adopted, namely, the water quality condition is evaluated according to the percentage of the number of the sections of each water quality type in the evaluation area to the total number of all the evaluation sections. The evaluation criteria are shown in the following table:

TABLE-2 Water quality evaluation criteria

(2) Primary pollution index determination

The method for determining the main pollution indexes of the section comprises the following steps:

when the water quality of the cross section is excellent or good, the main pollution index is not evaluated.

When the water quality of the cross section exceeds the III-class standard, the first three indexes with the worst water quality class are selected as main pollution indexes according to the quality of the water quality classes corresponding to different indexes. And when the water quality types corresponding to different indexes are the same, calculating the exceeding standard multiple, arranging the exceeding standard indexes according to the exceeding standard multiple, and taking the first three items with the maximum exceeding standard multiple as main pollution indexes. When heavy metals such as cyanide, lead, chromium and the like exceed the standard, the main pollution index is preferentially taken.

While the main pollution index is determined, the index concentration is marked to exceed the standard multiple of III-class water quality, i.e. exceeding multiple, such as permanganate index. And the overproof times of the items such as water temperature, pH value and dissolved oxygen are not calculated.

The method for determining the main pollution indexes of the area comprises the following steps:

indexes of water quality exceeding III standard are arranged according to the standard exceeding rate of the section, and the first three items with the maximum exceeding rate of the section are generally taken as main pollution indexes. For rivers and basins (water systems) with less than 5 sections, the main pollution index of each section is determined according to the method for determining the main pollution index of the section (1).

8. Water quality index calculation

(1) Water quality index of single index

Dividing the concentration value of each single index by the III-class standard limit value of the surface water corresponding to the index to calculate the water quality index of the single index, wherein the water quality index is shown in the following formula:

in the formula: c (i) is the concentration value of the ith water quality index; c_s(i) The standard limit value of the ith water quality index surface water class III; CWQI (i) is the water quality index of the ith water quality index.

Further:

method for calculating dissolved oxygen

In the formula: c (DO) is the concentration value of dissolved oxygen; c_s(DO) is the surface water class III standard limit for dissolved oxygen; CWQI (DO) is the water quality index of dissolved oxygen.

② calculation method of pH value

If the pH value is less than or equal to 7, the calculation formula is as follows:

if pH >7, the formula is calculated as:

in the formula: pH value_sdIs the lower limit value of the pH value in the quality standard of surface water environment (GB 3838-2002); pH value_sdIs the upper limit value of pH in the Standard for the quality of surface Water Environment (GB 3838-2002); CWQI (pH) is the water quality index of pH.

(2) Cross section water quality index

According to each single index CQI, taking the added value as the CQI of the section, and the calculation formula is as follows:

9. index of water quality comprehensive pollution

A water quality comprehensive pollution index algorithm is based on a single index water quality index evaluation method in the technical specification (trial) of urban surface water environment quality ranking, 21 indexes except water temperature, faecal coliform and total nitrogen in the table 1 of the quality standard of surface water environment (GB3838-2002) are adopted, the water quality indexes of the single indexes are calculated and then summed to serve as surface water monitoring section water quality indexes, and evaluation results comprise the water quality indexes of the single indexes and the section water quality indexes.

10. Urban water quality index calculation

(1) Water quality index of river

Firstly, calculating the arithmetic mean value of the concentrations of all single indexes of all river monitoring sections, taking the sum of CQI of all the single indexes as the CQI of the river, and calculating as shown in the following formula:

in the formula: CQI_{River flow}Is the water quality index of the river; CWQI (i) is the water quality index of the ith water quality index; n is the number of water quality indexes.

(2) Water quality index of lake or reservoir

The method for calculating the water quality index of the lake and the reservoir is consistent with that of a river, the arithmetic mean value of the concentrations of all the single indexes of the monitoring points of the lake and the reservoir is calculated, the water quality index of the single index is calculated, and then the water quality index of the lake and the reservoir is comprehensively calculated. In addition, when calculating the water quality index of a single index, the class III standard limit of the total phosphorus in lakes and reservoirs is 0.05mg/L, unlike that in rivers.

(3) Water quality index of city

According to the CQI of rivers and lakes and reservoirs in the urban district, taking the weighted mean value as the CQI city of the city, and calculating as shown in the following formula:

in the formula: CQI_CityIs the water quality index of the city; CQI_{River flow}Is the water quality index of the river; CQI_{Lake and reservoir}The water quality index of the lake reservoir is obtained; m is the number of river sections of the city; n is the number of lake and reservoir points in the city.

11. Calculating comprehensive standard exceeding index of Yangtze river economic zone area

The region water pollution concentration overproof index calculation formula is as follows:

R_{water jk}＝max(R_{Water ijk}) (2-22)

In the formula, R_{Water jk}Is the water pollutant concentration standard exceeding index, R, of the kth section of the area j_{Water j}Is the water contaminant concentration over-standard index for zone j.

Threshold and parameters:

and dividing the evaluation result into three types according to the following intervals according to the comprehensive standard exceeding index value of the pollutant concentration: when R is greater than 0, the environment is in an overload state; when R is-0.2-0, the environment is in a critical overload state; when R < -0.2, the environment is in a non-overload state. The smaller the pollutant concentration standard exceeding index is, the stronger the supporting capability of the regional environment system on the social and economic system is.

12. Assessment of water environment bearing capacity in Yangtze river economic area

And (4) adopting a water environment quality evaluation index method. The calculation process of the water environment quality evaluation index (R) comprises three steps: calculating the national control section C_ODCr、B_OD5. Ammonia nitrogen, TP, TN (river without calculating TN index) and C_ODThe acceptance index of 6 contaminants of Mn; the containment index is the ratio of the current pollutant value to the standard limit value of class III water in surface water, i.e.

Calculating each state control sectionMaximum containment index of contaminants, i.e.

And thirdly, calculating the arithmetic average value of the maximum accommodation indexes of all state-controlled section pollutants in the area to be evaluated. The comprehensive calculation formula is as follows:

in the formula, C_ijThe annual average concentration monitoring value of the water pollutant i of the state control section j is mg/L; s_iIs the standard limit value of the pollutant i in the III-class water of the surface water, mg/L; i is 1,2, …,6 corresponds to C_ODCr、B_OD5. Ammonia nitrogen, TP, TN and C_ODMn; j is 1,2, …, and N is the number of country control sections.

And dividing the evaluation result into three types of water environment overload, critical overload and non-overload according to the water environment quality evaluation index of the evaluation area. Generally, when the water environment quality evaluation index R is less than or equal to 0.7, the water environment is not overloaded; when R is more than 0.7 and less than or equal to 1.0, the water environment reaches the maximum bearing capacity, and the water environment is critical overload; when R >1.0, the aqueous environment is said to be "overloaded".

13. Ecological environment stress assessment

(1) Evaluation index score calculation method

And determining the type of the evaluation index according to the original data of the evaluation index and the assigning standard, and calculating by using a formula to obtain the score of the evaluation index. The evaluation index scores are all in the range of 0-100.

The evaluation index types are divided into 3 types, and the score value is calculated by the following method:

1) for the index of which the evaluation value is a fixed value, the median of the level is directly taken during assignment:

2) for the larger index, the better index is considered:

segmenting indexes:

no upper limit index:

when V is_iWhen > 100, 100 is taken as the Vi value.

3) For the smaller and better type index, the following is considered in the assignment:

segmenting indexes:

no upper limit index:

when Vi < 0, 0 is taken as Vi value.

In the above formula, V_iA score representing the evaluation index i; v_ilThe evaluation index i is the lower limit value of the category standard; v_ihThe upper limit value of the category standard of the evaluation index i is obtained; i is_iTo evaluate the index I raw data, I_ilAs raw data I_iThe lower limit of the classification; i is_ihAs raw data I_iThe upper limit of the classification.

(2) Method for calculating subentry index

And respectively calculating the scores of 6 subentry indexes of population pressure, land utilization, town pollution emission, rural non-point source emission, water resource utilization and basin external pressure by using a weighted summation method according to the scores of the single evaluation indexes.

In the formula, E_jThe value of the jth subentry index; w is a_jiThe weight of the ith evaluation index in the jth subentry index; v_jiThe score of the ith evaluation index in the jth subentry index; n is the number of the evaluation indexes in the jth subentry index.

(3) Ecological environment pressure special comprehensive assessment

And calculating the score of the ecological environment pressure special index by using a weighted summation method according to the scores of the subentry indexes. And (4) carrying out grade classification on the special indexes of the ecological environment pressure according to the scores to obtain a comprehensive evaluation result of river ecological environment pressure by river basin human activities.

In the formula, C is the value of the special index; wj is the weight of the jth subentry index; ej is the score of the jth subentry index; n is the number of the subentry indexes.

The pressure of the river ecological environment by the human activities in the drainage basin is divided into five levels: light, normal, heavy and severe.

TABLE-3 river pressure class description of basin human activities

14. Ecosystem health assessment

(1) Evaluation index score calculation

Comprehensive water quality condition B1:

the water quality comprehensive score B1-1:

physical habitat integrated conditions B2:

annual ecological base flow satisfaction rate B2-1:

and 4-9 months:

10 to 3 months

Connectivity B2-2:

natural shoreline ratio B2-3:

vegetation coverage of the riverbank B2-4:

the ratio of the wetland area to the total area B2-5: expert scoring

Comprehensive condition of aquatic organisms B3:

algal integrity B3-1:

integrity of Large benthonic animals B3-2:

fish integrity B3-3:

aquatic plant integrity B3-4:

and after the calculation of each index is finished, determining the type of the evaluation index according to the assigning standard, and calculating by using a formula to obtain the score of the evaluation index. The evaluation index scores are all in the range of 0-100. The evaluation index types are divided into 3 types, and the score calculation method is as follows:

index of fixed value of evaluation value:

the larger the better the type index:

segmenting indexes:

no upper limit index:

the smaller the better the type index:

segmenting indexes:

no upper limit index:

(2) Fractional index score

Calculating the score of the population pressure subentry index by using a weighted summation method according to the score of the single evaluation index

(3) Itemized index grading

TABLE-4 rating Scale

15. Ecological service function assessment

(1) Evaluation index score calculation

Drinking water service function C1:

water quality standard reaching rate of a centralized drinking water source:

water source conservation function C2:

water source conservation index:

c2-1 ═ mudflat wetland and marsh coverage x 0.5+ forest land coverage x 0.35+ meadow coverage x 0.15(2-48)

Water environment purification function C3:

the runoff-to-dirt ratio:

in the formula, the diameter Q is the designed flow of the river channel, and is determined according to the flow of the shortest month in 10 years, and the inlet Q is the river inflow amount of sewage.

The data source is as follows: the related data mainly come from environmental protection statistical data, hydrology yearbook data and water conservancy general survey data.

Biodiversity function C4:

representative rare species habitat C4-1: expert scoring

Invasion of foreign species C4-2: expert scoring

Aquatic product supply function C5:

fishing amount per unit water area:

age of fish C5-2: expert scoring

Protection zone function C6:

natural protected area level C6-1: expert scoring

index of fixed value of evaluation value:

the larger the better the type index:

segmenting indexes:

no upper limit index:

the smaller the better the type index

Segmenting indexes:

no upper limit index:

in the above formula, V_iA score representing the evaluation index i; v_ilThe evaluation index i is the lower limit value of the category standard; v_ihThe upper limit value of the category standard of the evaluation index i is obtained; i is_iTo evaluate the index I raw data, I_ilThe lower limit of the hierarchy where the original data Ii is located; i is_ihAs raw data I_iThe upper limit of the classification.

(2) Fractional index score

And calculating the score of the population pressure subentry index by using a weighted summation method according to the score of the single evaluation index:

in the formula, E_jThe value of the jth subentry index; w is a_jiThe weight of the ith evaluation index in the jth subentry index; v_jiFor the score of the ith evaluation index in the jth subentry indexA value; n is the number of the evaluation indexes in the jth subentry index.

(3) Itemized index grading

TABLE-5 itemized index rankings

16. Ecological risk assessment

(1) Evaluation index score calculation

Risk of outbreak D1:

critical ratio of chemicals D1-1:

production process D1-2:

the calculation method comprises the following steps: and evaluating the enterprise production process by adopting a scoring method, according to an enterprise risk unit questionnaire and referring to a method of 'enterprise emergency environment incident risk evaluation guideline', and evaluating and scoring the production process evaluation according to a table 5 respectively, and adding to determine the unit risk degree. The score is divided into five grades: 0 to 5, 5 to 10, 10 to 20, 20 to 30, and 30 to 40 minutes.

Exposure population D1-3:

the calculation method comprises the following steps: the grading standard of the number of the exposed population refers to the provision of the enterprise emergency environment incident risk assessment guideline and the corresponding research paper. And 5km of population of the organizations such as residential areas, medical health, cultural education, scientific research, administrative offices and the like around the enterprise is counted by referring to a sensitive environment protection target questionnaire. The classification is five grades: 0 to 0.1, 0.1 to 1, 1 to 5, 5 to 10, >10 thousands of people.

Sensitive Environment object D1-4:

the calculation method comprises the following steps: the different environmental sensitive zones were assigned values (tables 5-3) and then superimposed as the index score. The score is divided into five grades: 0 to 5, 5 to 10, 10 to 15, 15 to 20 and >20 minutes.

Security management and risk prevention D1-5:

the calculation method comprises the following steps: and evaluating according to enterprise security management and risk prevention questionnaires, wherein each item is 1 point if yes, 0 point if no and 80 points if full. The classification is five grades: 75-80, 70-75, 65-70, 60-65 and 55-60 minutes.

Risk of land and shipping D1-6:

the calculation method comprises the following steps: the flow risk sources were evaluated with reference to the "guidelines for environmental protection of centralized drinking water sources" (tables 5-4). And taking the total score R as a final index classification basis. R ═ f1+ f2+ f3, and R is classified into five stages: 0 to 3, 3 to 7, 7 to 9, 9 to 15 and >15 points.

Cumulative risk indicator D2:

toxic and harmful organic matters D2-1:

heavy metal D2-2:

mine nonmetal D2-3:

(2) fractional index score

In the formula, E_jFor j-th subentry indexA score value; w is a_jiThe weight of the ith evaluation index in the jth subentry index; v_jiThe score of the ith evaluation index in the jth subentry index; n is the number of the evaluation indexes in the jth subentry index.

(3) Itemized index grading

TABLE-6 itemized index rankings

17. Ecological safety assessment

Selecting a weighted summation method as a basic algorithm of a model

(1) The solution layer is calculated by the formula:

in the formula, B_iCalculating the result of the ith scheme layer; x is the number of_ijThe j index value of the ith scheme layer; w is a_jIs the weight of the jth index of the ith scheme layer.

(2) The Ecological Safety Index (ESI) of the target layer is calculated by the following formula, and the result is a value between 0 and 100:

in the formula, ESI is an ecological safety index; bi is the value of the ith scheme layer.

(3) Weight determination

Based on the re-screened evaluation index system, the index weight needs to be re-determined, namely the judgment matrix, and the judgment can be carried out by adopting an expert consulting method. The evaluation index weight of each expert can be obtained according to the AHP method, however, the judgment results among the experts are often large in inconsistency and influenced by the preference of the experts, and therefore a multi-criterion group decision model is introduced to obtain a comprehensive judgment matrix with more objectivity.

(4) Ecological safety assessment standard and grade

And taking an Ecological Safety Index (ESI) as a longitudinal comparison result of the current situation of the rivers and the standard state, and reflecting the deviation degree of each river relative to the standard state. The ESI is 100 as the non-deviation state, and the smaller the ESI is, the unsafe the river is.

TABLE-7 evaluation standard for ecological safety of river

Grade	Representative color	Score value
			Secure	Blue color	(80,100]
Is safer	Green colour	(60,80]
			In general	Yellow colour	(40,60]
Is not safe	Red colour	(20,40]
			Is very unsafe	Black color	[0,20]

And step 204, extracting keywords and abstract of the text data based on a second data mining module, and performing structured processing on the information of the text data.

Optionally, based on the second data mining module, performing keyword extraction on the text data, including: based on the TextRank algorithm, the text is divided into a plurality of composition units, a graph model is established, important components in the text are sequenced by using a voting mechanism, and keyword extraction is carried out on the text data.

In one possible implementation, the TextRank algorithm is a graph-based ranking algorithm for text. The basic idea of the PageRank algorithm from Google is that a text is divided into a plurality of composition units (words and sentences), a graph model is established, important components in the text are sequenced by using a voting mechanism, and keyword extraction and abstract can be realized only by using the information of a single document. Different from models such as LDA and HMM, the TextRank does not need to learn and train a plurality of documents in advance, and is widely applied due to simplicity and effectiveness.

The TextRank general model can be expressed as a directed weighted graph G ═ V, E, consisting of a set of points V and a set of edges E, E being a subset of V × V. Any two points V in the figure_iAnd V_jThe weight of the edge in between is w_jiFor a given point V_i, In(V_i) To point to the set of points at that point, Out (V)_i) Is a point V_iThe set of points pointed to. Point V_iThe score of (a) is defined as follows:

wherein d is a damping coefficient, has a value range of 0 to 1, represents a probability of pointing to any other point from a certain point in the graph, and generally has a value of 0.85. When calculating the score of each point in the graph by using the TextRank algorithm, it is necessary to assign any initial value to the point in the graph and recursively calculate until convergence is reached, that is, when the error rate of any point in the graph is less than a given limit value, the limit value is generally 0.0001.

The task of keyword extraction is to automatically extract a number of meaningful words or phrases from a given piece of text. The TextRank algorithm is to sort subsequent keywords by using the relation (co-occurrence window) between local vocabularies and directly extract the keywords from the text itself. The method mainly comprises the following steps:

given text T is divided according to complete sentences, i.e.

T＝[S₁，S₂，…，S_m] (2-65)

For each sentence S_iE.g. T, performing word segmentation and part-of-speech tagging, filtering out stop words, and only retaining words with specified part-of-speech, such as noun, verb and adjective, i.e. S_i＝[t_i，1，t_i，2，…，t_i，n]Wherein t is_i，j∈S_jAre the candidate keywords after retention.

And thirdly, constructing a candidate keyword graph G (V, E), wherein V is a node set and consists of candidate keywords generated by the second step, then constructing an edge between any two points by adopting a co-occurrence relation (co-occurrence), wherein the edge exists between the two nodes only when the corresponding words co-occur in a window with the length of K, and the K represents the size of the window, namely the maximum number of the co-occurrence of K words.

And fourthly, iteratively propagating the weight of each node according to the formula until convergence.

And fifthly, sorting the node weights in a reverse order to obtain the most important T words as candidate keywords.

Obtaining the most important T words, marking them in the original text, if forming adjacent phrase, combining them into multiword key words. For example, the text has a sentence "Matlab code for marking ambiguy function", and if "Matlab" and "code" both belong to candidate keywords, they are combined into "Matlab code" to be added into the keyword sequence.

Optionally, based on the second data mining module, performing summary extraction on the text data, including: searching in the data according to a Query statement of the text data to obtain a plurality of search results; performing morpheme analysis on the text data to generate a plurality of morphemes; for each search result, calculating a relevance score of each morpheme and each search result; and carrying out weighted summation on the correlation scores of the morphemes relative to the search results to obtain the correlation scores of the Query sentences and the search results, and carrying out abstract extraction on the text data according to the correlation scores of the Query sentences and the search results.

In one possible implementation, an automatic summarization algorithm is typically used for search relevance scoring. The main idea is to perform morpheme analysis on Query to generate morpheme q_i(ii) a Then, for each search result D, each morpheme q is calculated_iScoring the correlation with D, and finally, scoring q_iAnd carrying out weighted summation relative to the relevance scores of D, thereby obtaining the relevance scores of Query and D.

The general formula is as follows:

wherein Q represents Query, Q_iRepresenting a morpheme after Q-parsing; d represents a search result document; w_iRepresenting morphemes q_iThe weight of (c); r (q)_iAnd d) represents morpheme q_iA relevance score to document d.

Definition of W_iTaking IDF as an example, the formula is as follows:

where N is the number of all documents in the index, N (q)_i) To comprise q_iThe number of documents.

Relevance score R (q) for morpheme qi and document d_iD) is calculated as follows:

wherein k is₁，k₂B is an adjustment factor, usually set empirically, and is generally k₁＝2，b＝0.75；f_iFor the frequency of occurrence of qi in d, qf_iIs q_iFrequency of occurrence in Query. dl is the length of document d and avgdl is the average length of all documents.

The function of parameter b is to adjust the size of the influence of the document length on the relevance. The larger b, the greater the influence of the document length on the relevance score and vice versa. And the longer the relative length of the document, the greater the value of K will be, and the smaller the relevance score will be. This can be understood as when the document is long, containing q_iThe greater the chance of (f), and therefore, the same f_iIn the case of (1), a long document is associated with q_iShould be more relevant than the short document and q_iThe correlation of (2) is weak.

In summary, the formula can be summarized as:

as can be seen from the formula, different search relevance score calculation methods can be derived by using different morpheme analysis methods, morpheme weight determination methods and morpheme-document relevance determination methods, so that great flexibility is provided for designing an algorithm.

Optionally, the information of the text data is structured, including:

carrying out structuring processing on the information of the text data, searching geographic position information in the mining data by adopting a word segmentation technology based on combination of rules and statistics based on a water environment word segmentation dictionary, and positioning through an electronic map; and carrying out classified display on the digging data according to the screening conditions.

In a feasible implementation mode, the comprehensive application is that information contained in a text is subjected to structured processing, a water environment word segmentation dictionary accumulated step by step is adopted, a word segmentation technology based on combination of rules and statistics is adopted, geographic position information is retrieved from the content, positioning is carried out through an electronic map, and meanwhile classified display of viewed content according to screening conditions is provided.

And step 205, acquiring and storing the mining data obtained through the data mining model.

In a possible implementation manner, after the standardized output data mining result is obtained through the data mining model, the mining data can be stored in a storage area of the data mining platform, so that the mining data can be provided for a user in a subsequent user query.

And step 206, feeding back the mining data through the data encapsulation exchange interface when receiving the query request corresponding to the mining data.

In a possible implementation manner, the data mining tool encapsulation technology adopts a network address and interface encapsulation technology, and adopts Web Service as a data encapsulation exchange interface. The overall exchange method comprises the following steps: the data supplier provides Web Service interface to issue data, and the data demander calls the Web Service interface to obtain data.

In the embodiment of the invention, the original data is obtained from each application through an interface access layer; preprocessing the original data through a data acquisition ETL platform to obtain input data meeting the model standard, and inputting the input data into a pre-trained data mining model; the data mining model is divided into a first data mining module facing service evaluation and a second data mining module facing text analysis; calculating a data model index based on the first data mining module; the data model indexes are divided into 4 types, namely a section water quality evaluation type, a water quality index calculation type, a water environment bearing capacity evaluation type and a water ecological safety evaluation type; based on the second data mining module, carrying out keyword extraction and abstract extraction on the text data, and carrying out structural processing on the information of the text data; acquiring and storing mining data obtained through the data mining model; and when receiving a query request corresponding to the mining data, feeding back the mining data through the data encapsulation exchange interface. Therefore, large data such as hydrology, water resources, water environments, meteorology, social economy and the like are taken as analysis objects around a water environment management target, the mining requirements of watershed water environment data are summarized and analyzed from the aspects of evaluation decision and service management, the data mining theme and target are determined by combining the time characteristics and the space characteristics of the water environment management service, a data mining service model taking application scenes such as current state analysis, cause analysis, traceability analysis, potential evaluation, anomaly identification, trend early warning and the like as analysis objects is constructed, and data mining is realized.

FIG. 3 is a block diagram illustrating a watershed water environment-based big data mining device according to an exemplary embodiment. Referring to fig. 3, the apparatus 300 includes an obtaining module 310, a preprocessing module 320, a calculating module 330, an extracting module 340, a storing module 350, and a querying module 360; wherein:

an obtaining module 310, configured to obtain original data from each application through an interface access layer;

the preprocessing module 320 is used for preprocessing the original data through a data acquisition ETL platform to obtain input data meeting the model standard, and inputting the input data into a pre-trained data mining model; the data mining model is divided into a first data mining module facing service evaluation and a second data mining module facing text analysis;

a calculation module 330, configured to calculate a data model index based on the first data mining module; the data model indexes are divided into 4 types, namely a section water quality evaluation type, a water quality index calculation type, a water environment bearing capacity evaluation type and a water ecological safety evaluation type;

the extracting module 340 is configured to perform keyword extraction and abstract extraction on the text data based on the second data mining module, and perform structured processing on the information of the text data;

a storage module 350, configured to obtain and store mining data obtained through the data mining model;

and the query module 360 is configured to feed back the mining data through the data encapsulation exchange interface when receiving a query request corresponding to the mining data.

Fig. 4 is a schematic structural diagram of a big data mining platform 400 according to an embodiment of the present invention, where the big data mining platform 400 may generate relatively large differences due to different configurations or performances, and may include one or more processors (CPUs) 401 and one or more memories 402, where at least one instruction is stored in the memory 402, and the at least one instruction is loaded and executed by the processors 401 to implement the steps of the big data mining method based on the watershed water environment.

In an exemplary embodiment, a computer-readable storage medium, such as a memory, including instructions executable by a processor in a terminal, is also provided to perform the above-described watershed water environment-based big data mining method. For example, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A big data mining method based on watershed water environment is characterized by comprising the following steps:

acquiring and storing mining data obtained through the data mining model;

2. The method of claim 1, wherein said pre-processing said raw data by a data acquisition ETL platform comprises:

3. The method according to claim 1, wherein the fracture surface water quality evaluation types comprise river water quality evaluation, lake and reservoir eutrophication evaluation, surface water drinking water quality evaluation, groundwater drinking water quality evaluation, coastal sea area water quality evaluation and regional water quality evaluation.

4. The method of claim 1, wherein the water quality index calculation types comprise water quality index calculation, water quality comprehensive pollution index, urban water quality index calculation and Changjiang river economic zone comprehensive overproof index data calculation.

5. The method according to claim 1, wherein the water environment bearing capacity assessment types comprise Yangtze river economic area and water environment bearing capacity assessment, ecological environment pressure assessment, ecological system health assessment, ecological service function assessment and ecological risk assessment.

6. The method of claim 1, wherein the type of water ecological safety assessment comprises a water ecological safety assessment.

7. The method of claim 1, wherein extracting keywords from text data based on the second data mining module comprises:

8. The method of claim 1, wherein abstracting text data based on the second data mining module comprises:

9. The method according to claim 1, wherein the structuring the information of the text data comprises:

10. A big data mining device based on watershed water environment is characterized by comprising:

the preprocessing module is used for preprocessing the original data through a data acquisition ETL platform to obtain input data meeting the model standard and inputting the input data into a pre-trained data mining model; the data mining model is divided into a first data mining module facing service evaluation and a second data mining module facing text analysis;