CN117473464B

CN117473464B - Enterprise environment treatment data processing system based on big data analysis

Info

Publication number: CN117473464B
Application number: CN202311824362.8A
Authority: CN
Inventors: 余姝洁; 梁天池; 刘思焜; 罗旌生; 冯子杰; 李棉; 梁智超; 冯承婷; 萧素婷
Original assignee: Zhongshan Environmental Protection Technology Center
Current assignee: Zhongshan Environmental Protection Technology Center
Priority date: 2023-12-28
Filing date: 2023-12-28
Publication date: 2024-04-02
Anticipated expiration: 2043-12-28
Also published as: CN117473464A

Abstract

The invention relates to the field of data processing, in particular to an enterprise environment management data processing system based on big data analysis, which comprises: acquiring a plurality of sampling data according to enterprise environment data; taking enterprise environment data in a section of each sampling data as initial neighborhood data of each sampling data; obtaining the discrete degree of the initial neighborhood data of each sampling data, deleting or expanding the initial neighborhood data of each sampling data, and obtaining the neighborhood data of each sampling data; obtaining consumption characteristic values and emission characteristic values corresponding to each sampling data, obtaining correlation coefficients of each pair of typical correlation variables of each sampling data, obtaining multidimensional characteristics of each sampling data to construct an isolated forest, obtaining initial anomaly scores of each sampling data, obtaining anomaly scores of each sampling data, and obtaining a regression model according to the anomaly scores of each sampling data. The invention improves the accuracy of the regression model.

Description

Enterprise environment treatment data processing system based on big data analysis

Technical Field

The invention relates to the field of data processing, in particular to an enterprise environment management data processing system based on big data analysis.

Background

Through technologies such as big data analysis, artificial intelligence and machine learning, enterprises can be helped to carry out digital transformation, and multisource environment data is integrated into a centralized system, so that visualization, analysis and management of the data are improved. Environmental indexes such as emission, waste generation, energy use conditions and the like are monitored in real time, a clearer environmental treatment view angle can be provided for enterprises, in order to mine the relevance between data, the energy use conditions and emission conditions of various types of equipment are analyzed in a regression analysis mode, and high pollution conditions possibly occurring in the subsequent production process are predicted in time.

When the traditional PLS partial least square method is used for regression analysis, the weight coefficient of each sampling data is consistent, but in the actual equipment operation process, larger deviation of the correlation between the consumption and the discharge corresponding to the sampling data at a certain moment possibly occurs due to abnormal operation of a certain equipment, namely, the enterprise environment data is abnormal, so that the deviation of the regression coefficient is inevitably caused when the sampling data is used for regression analysis.

Disclosure of Invention

In order to solve the above problems, the present invention provides an enterprise environmental remediation data processing system based on big data analysis, the system comprising:

the enterprise environment data acquisition module is used for acquiring enterprise environment data, wherein the enterprise environment data comprises consumption data and emission data;

the neighborhood data acquisition module of the sampling data is used for dividing the enterprise environment data into a plurality of segments, and recording the enterprise environment data of the central position of each segment as the sampling data to obtain a plurality of sampling data; taking enterprise environment data in a section of each sampling data as initial neighborhood data of each sampling data; acquiring the discrete degree of the initial neighborhood data of each sampling data according to the sampling data and consumption data and emission data corresponding to the initial neighborhood data of the sampling data; deleting or expanding the initial neighborhood data of each sampling data according to the discrete degree of the initial neighborhood data of each sampling data to obtain the neighborhood data of each sampling data;

the abnormal score acquisition module of the sampling data is used for acquiring a consumption characteristic value corresponding to each sampling data according to the sampling data and consumption data corresponding to neighborhood data of the sampling data; acquiring an emission characteristic value corresponding to each sampling data according to the sampling data and the emission data corresponding to the neighborhood data of the sampling data; acquiring a correlation coefficient of each pair of typical correlation variables of each sampled data according to the emission characteristic value and the consumption characteristic value corresponding to each sampled data; constructing multidimensional features of each sampled data according to the neighborhood data number, the corresponding consumption characteristic value, the corresponding emission characteristic value and the correlation coefficient of each pair of typical correlation variables of each sampled data; constructing an isolated forest according to the multidimensional features of each sampled data, and acquiring an initial anomaly score of each sampled data; correcting the initial abnormal score of each sampling data according to the fluctuation of consumption data and the fluctuation degree of emission data corresponding to the sampling data and neighborhood data of the sampling data, and obtaining the abnormal score of each sampling data;

And the regression model acquisition module is used for acquiring an abnormal score matrix according to the abnormal score of each piece of sampling data, correcting the objective function and acquiring a regression model.

Preferably, the step of dividing the enterprise environment data into a plurality of segments includes the steps of:

number of preset dataSequentially dividing the enterprise environment data under the time sequence into a plurality of segments, wherein the enterprise environment data of each segment is +.>。

Preferably, the obtaining the discrete degree of the initial neighborhood data of each sampling data according to the sampling data and the consumption data and the discharge data corresponding to the initial neighborhood data of the sampling data includes the steps of:

taking each consumption data and each discharge data as one dimension in the multidimensional space respectively, taking the firstThe sampled data are converted into a coordinate point in the multidimensional space, which is marked as +.>Multidimensional coordinate points of the sampled data; will be->Sample data->Converting the initial neighborhood data into a coordinate point in the multidimensional space, and recording as +.>Sample data->Multidimensional coordinate points of the initial neighborhood data;

in the method, in the process of the invention,represents->The degree of discretization of the initial neighborhood data of the individual sample data; />Represents- >The number of initial neighborhood data of the individual sample data; />Represents->Multidimensional coordinate point of the sampled data and +.>Sample data->Euclidean distance between multidimensional coordinate points of the initial neighborhood data.

Preferably, the step of deleting or expanding the initial neighborhood data of each sampled data according to the degree of discretization of the initial neighborhood data of each sampled data to obtain the neighborhood data of each sampled data includes the steps of:

acquiring the average value of the discrete degree of the initial neighborhood data of all the sampled data, and recording asPreset step size->When->The degree of dispersion of the initial neighborhood data of the individual sample data is greater than or equal to +.>At the time, the step length is +.>For->The first deletion is carried out at the two ends of the segment where the sampling data are positioned to obtain the +.>First puncturing of the sample data, will be +.>Enterprise environment data in the first pruned segment of the sample data is denoted as +.>First deleting neighborhood data of each sample data, obtaining +.>The degree of discretization of the first truncated neighborhood data of the sampled data, when +.>The degree of discretization of the first truncated neighborhood data of the sampled data is greater than or equal to +. >At the time, the step length is +.>For->Second pruning is performed on both ends of the first pruned segment of the sampled data to obtain +.>Second truncated segment of the sampled data, will be +.>Enterprise environmental data in the second pruned segment of the sampled data is noted +.>Second pruning neighborhood data of each sample data, obtaining +.>Second pruning of sample data neighborhood data discretization degree and +.>Make a comparison, and so on, until +.>Sample data->The degree of dispersion of the sub-pruned neighborhood data is less than +.>Stop at time, will be->Sample data->Hypopruned neighborhood data as +.>Neighborhood data of the individual sample data;

when the first isThe degree of dispersion of the initial neighborhood data of the sample data is smaller than +.>At the time, the step length is +.>For->Expanding the two ends of the section where the sampling data are located to obtain the +.>The first expansion of the sample data will be +.>The enterprise environment data in the first expansion of the sample data is denoted as +.>First expansion neighborhood data of each sample data, obtaining +.f. according to the method of obtaining the discrete degree of the initial neighborhood data of each sample data >The degree of discretization of the first expansion neighborhood data of the sample data, when +.>The degree of dispersion of the first expansion neighborhood data of the sampled data is less than + ->At the time, the step length is +.>For the firstThe two ends of the first expansion section of the sampled data are expanded to obtain the +.>Second expansion of the sample data, will be +.>The enterprise environment data in the second expansion section of the sample data is marked as +.>Second expanding neighborhood data of the sampled data; and so on, up to +.>Sample data->The degree of discretization of the secondary expansion neighborhood data is greater than orEqual to->When it is, will be->Sample data->Sub-expansion neighborhood data as +.>Neighborhood data of the sampled data.

Preferably, the obtaining the consumption characteristic value corresponding to each sampling data according to the sampling data and the consumption data corresponding to the neighborhood data of the sampling data includes the steps of:

wherein,represents->Consumption characteristic values corresponding to the sampling data; />Representing the type number of consumption data corresponding to the sampling data; />Represents->Sample data +.>All of the first neighborhood data of the sampling dataSeed consumption dataIs the average value of (2); / >Represents->Sample data +.>All of the +.>Standard deviation of consumption data.

Preferably, the acquiring the emission characteristic value corresponding to each sampling data according to the sampling data and the emission data corresponding to the neighborhood data of the sampling data includes the steps of:

in the method, in the process of the invention,represents->Emission characteristic values corresponding to the sampling data; />The number of types of emission data corresponding to the representative sampling data; />Represents->Sample data +.>All of the +.>Average of seed emission data; />Represents->Sample data +.>All of the +.>Standard deviation of emission data.

Preferably, the obtaining the correlation coefficient of each pair of typical correlation variables of each sampled data according to the emission characteristic value and the consumption characteristic value corresponding to each sampled data includes the steps of:

acquiring the previous data of each sample by using typical correlation analysis according to the emission characteristic value and the consumption characteristic value corresponding to each sampleFor a representative correlation variable, a correlation coefficient for each pair of representative correlation variables for each sample data is obtained.

Preferably, the constructing an isolated forest according to the multidimensional feature of each sampled data, and obtaining the initial anomaly score of each sampled data includes the steps of:

presetting the depth maximum value of the tree asThe number of sampled data is marked +.>The number of training samples per tree isThe number of trees is->Constructing an isolated forest according to the multidimensional characteristics of each sampled data, and acquiring an initial anomaly score of each sampled data according to the constructed isolated forest; />Representing an upward rounding symbol; />Representing a maximum function.

Preferably, the step of correcting the initial anomaly score of each sampled data according to the fluctuation of the consumption data and the fluctuation degree of the discharge data corresponding to the sampled data and the neighborhood data of the sampled data to obtain the anomaly score of each sampled data includes the steps of:

in the method, in the process of the invention,represents->Abnormal scores of the individual sampled data; />The number of types of emission data corresponding to the representative sampling data; />Represents->Sample data +.>The +.>Standard deviation of first order difference value of seed emission data; />Representing the type number of consumption data corresponding to the sampling data; / >Represents->Sample data +.>The +.>Standard deviation of first order differential value of seed consumption data; />Represents->Initial anomaly scores for the individual sampled data; />Representing a normalization function.

Preferably, the step of obtaining an anomaly score matrix according to the anomaly score of each sampled data to correct the objective function and obtain a regression model includes the steps of:

obtaining the final circulation times of the objective function of the partial least square method by using a K-fold cross validation mode, and marking the final circulation times as；

Acquiring an interpretation variable matrix in the first circulation process: interpreting the size of the variable matrix asWherein->The number of types of consumption data corresponding to the sampled data，/>Representing the number of sampled data; taking each consumption data corresponding to each sampling data as one row of data of an interpretation variable matrix according to the sequence of the acquisition time of the sampling data to obtain the interpretation variable matrix in the first circulation process; acquiring an explained variable matrix in the first circulation process: the size of the matrix of interpreted variables is +.>，/>The number of kinds of emission data corresponding to the sampling data, < >>Representing the number of sampled data; according to the sequence of the acquisition time of the sampling data, taking each emission data corresponding to each sampling data as one row of data of an interpretation variable matrix to obtain the interpretation variable matrix in the first circulation process;

In the method, in the process of the invention,an anomaly score representing sample data 1; />An anomaly score representing sample data at 2 nd;represents->Abnormal scores of the individual sampled data; />Representing an anomaly score matrix;

will beThe feature vector corresponding to the maximum feature value is used as a weight coefficient of an interpretation variable in the first circulation process; will->The feature vector corresponding to the maximum feature value is used as a weight coefficient of an explained variable in the first circulation process;

in the method, in the process of the invention,representing an anomaly score matrix; />Representing an interpretation variable matrix during a first cycle; />Representing an interpreted variable matrix during a first cycle; />A weight coefficient representing the interpreted variable during the first cycle; />A weight coefficient representing an interpretation variable during the first cycle; />An objective function representing a first cycle of the partial least squares method;

in the method, in the process of the invention,representing an interpretation variable matrix during a first cycle; />Representing an interpreted variable matrix during a first cycle; />A weight coefficient representing an interpretation variable during the first cycle; />Representing an interpretation variable matrix during the second cycle; />Representing an interpreted variable matrix during a second cycle; / >The acquisition method of (1) is that；/>The acquisition method of (1) is->；

Acquisition method using objective function of first cycle, acquisition of objective function of second cycle, and so on until acquisition of the first cycleThe objective function of the sub-loop can be stopped;

according to the firstThe objective function of the sub-loop obtains a regression model of the partial least squares method.

The invention has the following beneficial effects: according to the invention, the enterprise environment data is divided into each section, the enterprise environment data at the central position of each section is recorded as sampling data, and only the sampling data is analyzed to construct a regression model, so that the regression model is prevented from being too complex due to large data quantity, and then the enterprise environment data in the section where each sampling data is located is used as initial neighborhood data of each sampling data; the method comprises the steps of deleting or expanding initial neighborhood data of each sampling data by acquiring the discrete degree of the initial neighborhood data of each sampling data to obtain the neighborhood data of each sampling data, enabling the discrete degree of each neighborhood data of each sampling data to be on the same level, facilitating the follow-up characterization of characteristics of the sampling data according to the neighborhood data of the sampling data to be more accurate, then acquiring a consumption characteristic value and a discharge characteristic value corresponding to each sampling data, obtaining correlation coefficients of each pair of typical correlation variables of each sampling data according to the discharge characteristic value and the consumption characteristic value corresponding to each sampling data to form multidimensional characteristics of each sampling data, constructing an isolated forest according to the multidimensional characteristics of each sampling data, and acquiring initial abnormal scores of each sampling data; and then correcting the initial abnormal score of each sampling data according to the sampling data and the fluctuation degree of consumption data and emission data corresponding to the neighborhood data of the sampling data, acquiring the abnormal score of each sampling data, so that the acquired abnormal score of each sampling data is more accurate, and finally, correcting the objective function according to the abnormal score acquisition abnormal score matrix of each sampling data to acquire a regression model, thereby improving the accuracy of the regression model.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions and advantages of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a system block diagram of an enterprise environmental remediation data processing system based on big data analysis, in accordance with one embodiment of the present invention.

Detailed Description

In order to further describe the technical means and effects adopted by the present invention to achieve the preset purpose, the following detailed description refers to the specific implementation, structure, characteristics and effects of the enterprise environmental management data processing system based on big data analysis according to the present invention with reference to the accompanying drawings and the preferred embodiments. In the following description, different "one embodiment" or "another embodiment" means that the embodiments are not necessarily the same. Furthermore, the particular features, structures, or characteristics of one or more embodiments may be combined in any suitable manner.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

The following specifically describes a specific scheme of the enterprise environmental management data processing system based on big data analysis provided by the invention with reference to the accompanying drawings.

Referring now to FIG. 1, a system for processing environmental remediation data for an enterprise based on big data analysis is shown, the system comprising the following modules:

the enterprise environment data acquisition module 101 acquires enterprise environment data.

It should be noted that, in this embodiment, regression analysis is performed between consumption and discharge corresponding to enterprise environmental data by using PLS partial least square method, so as to determine a relationship between consumption and discharge corresponding to enterprise environmental data, and further predict environmental destruction of each device in the production process, so that the enterprise environmental data needs to be collected before starting analysis.

In the embodiment of the invention, each device is installed and collected in a factory, each monitoring device is used for collecting the power consumption, the water resource consumption, the energy use efficiency, the device yield, the solid waste discharge, the liquid waste discharge, the gas discharge and the noise level of the factory as one enterprise environment data, and the collection frequency is 10 min/time, wherein the consumption data corresponding to the enterprise environment data are respectively the power consumption, the water resource consumption, the energy use efficiency and the device yield of the factory; the enterprise environmental data corresponds to emissions data, such as solid waste emissions, liquid waste emissions, gas emissions, and noise levels, respectively, of the plant.

Thus, the enterprise environment data is collected.

The neighborhood data acquisition module 102 of the sampling data samples the collected enterprise environment data to acquire a plurality of sampling data, presets initial neighborhood data of each sampling data, acquires the discrete degree of the initial neighborhood data of each sampling data, and adjusts the initial neighborhood data of each sampling data according to the discrete degree of the initial neighborhood data of each sampling data to acquire the neighborhood data of each sampling data.

It should be noted that, since the amount of the enterprise environment data collected in the module 101 is large, if regression analysis is performed according to all the collected enterprise environment data, the model becomes too complex, and the risk of overfitting increases, so that subsampling is performed in the collected enterprise environment data as a training sample to perform regression analysis.

In the embodiment of the invention, the number of data is presetDividing the enterprise environment data under the time sequence into each section in turn, wherein the enterprise environment data of each section is +.>It should be noted that, when the number of the enterprise environment data of the last section does not reach +.>When it is not a segment; taking enterprise environment data at the central position of each segment as sampling data to obtain a plurality of sampling data, wherein in the embodiment of the invention, the number of the preset data is +. >In other embodiments, the practitioner may be based on the particularImplementation case setting->Is a value of (2).

It should be noted that when the traditional PLS partial least square method is used for regression analysis, the weight coefficient of each sample data is consistent, but in the actual device operation process, a larger deviation may occur between the consumption corresponding to the sample data and the emission corresponding to the sample data at a certain time due to a certain device operation abnormality, that is, the enterprise environment data is abnormal, so that the regression coefficient is inevitably caused to deviate when the regression analysis is performed by using the sample data, therefore, the invention aims to correct the weight coefficient of each sample data according to the abnormal condition of each sample data, and to combine the neighborhood data of each sample data when the abnormal condition of each sample data is analyzed, and to make the discrete degree of the neighborhood data of each sample data on the same level, thereby being beneficial to the subsequent acquisition of the weight coefficient of each sample data.

In the embodiment of the invention, initial neighborhood data of each sampling data is acquired: will be the firstOther enterprise environmental data in the section of the sample data as +.>And obtaining the initial neighborhood data of each sampling data.

Acquiring the discrete degree of the initial neighborhood data of each sampling data:

eliminate eachConsumption data and emission data are respectively used as one dimension in the multidimensional space, and the first dimensionThe sampled data are converted into a coordinate point in the multidimensional space, which is marked as +.>Multidimensional coordinate points of the sampled data; will be->Sample data->Converting the initial neighborhood data into a coordinate point in the multidimensional space, and recording as +.>Sample data->Multidimensional coordinate points of the initial neighborhood data;

in the method, in the process of the invention,represents->The degree of discretization of the initial neighborhood data of the individual sample data; />Represents->The number of initial neighborhood data of the individual sample data; />Represents->Multidimensional coordinate point of the sampled data and +.>Sample data->Euclidean distances between the multidimensional coordinate points of the initial neighborhood data; />Represents->Multidimensional coordinate point of the sampled data and +.>Sample data- >The square of Euclidean distance between multidimensional coordinate points of the initial neighborhood data, the larger its value, the description of the +.>Sample data and->Sample data->The more discrete the initial neighborhood data is, the +.>The AND of the sample data->Sample data->The degree of dispersion of the individual initial neighborhood data is not at one level,/or->The larger the value of (2), the description of +.>Sample data and->The degree of dispersion of the initial neighborhood data of the sample data is not at one level, the +.>The degree of dispersion of the initial neighborhood data of the individual sample data is large.

Acquisition of the firstThe method for neighborhood data of the sampled data comprises the following steps:

acquiring the average value of the discrete degree of the initial neighborhood data of all the sampled data, and recording asPreset step size->When->The degree of dispersion of the initial neighborhood data of the individual sample data is greater than or equal to +.>At the time, the step length is +.>For->The first deletion is carried out at the two ends of the segment where the sampling data are positioned to obtain the +.>First truncated segment of the sampled data, i.e. truncated +.>Personal enterprise environment data, will->Enterprise environment data in the first pruned segment of the sample data is denoted as +.>First deleting neighborhood data of each sample data, obtaining +. >The degree of discretization of the first truncated neighborhood data of the sampled data, when +.>The degree of discretization of the first truncated neighborhood data of the sampled data is greater than or equal to +.>At the time, the step length is +.>For->Second pruning is performed on both ends of the first pruned segment of the sampled data to obtain +.>Second truncated segment of the sampled data, will be +.>Enterprise environmental data in the second pruned segment of the sampled data is noted +.>Second pruning neighborhood data of each sample data, obtaining +.>Second pruning of sample data neighborhood data discretization degree and +.>Make a comparison, and so on, until +.>Sample data->The degree of dispersion of the sub-pruned neighborhood data is less than +.>Stop at time, will be->Sample data->Hypopruned neighborhood data as +.>Neighborhood data of the individual sample data; note that, the firstThe enterprise environment data in the section of the sample data is +.>Initial neighborhood data of the sample data +.>Sample data->Secondary pruning of neighborhood data;

when the first isInitial neighborhood number of sampled dataThe degree of discretization is less than- >At the time, the step length is +.>For->Expanding the two ends of the section where the sampling data are located to obtain the +.>The first expansion of the sample data, i.e. the expansion of the two ends>The enterprise environment data of the other sections except for the two ends of the corresponding section of the sampling data are expanded, if a section does not exist other enterprise environment data, namely the section is close to the boundary part of the enterprise environment data, the expansion of the end not existing other enterprise environment data is not performed any more, and the enterprise environment data is expanded>The enterprise environment data in the first expansion of the sample data is denoted as +.>First expansion neighborhood data of each sample data, obtaining +.f. according to the method of obtaining the discrete degree of the initial neighborhood data of each sample data>The degree of discretization of the first expansion neighborhood data of the sample data, when +.>The degree of dispersion of the first expansion neighborhood data of the sampled data is less than + ->At the time, the step length is +.>For->The two ends of the first expansion section of the sampled data are expanded to obtain the +.>Second expansion of the sample data, will be +.>The enterprise environment data in the second expansion section of the sample data is marked as +.>Second expanding neighborhood data of the sampled data; and so on, up to +. >Sample data->The degree of dispersion of the secondary expansion neighborhood data is greater than or equal to +.>At this time, will be->Sample data->Sub-expansion neighborhood data as +.>Neighborhood data of the sample data, it should be noted that +.>The enterprise environment data in the section of the sample data is +.>Number of samplesAccording to the initial neighborhood data +.>Sample data->The neighborhood data is secondarily expanded.

In the embodiment of the invention, the step length is presetIn other embodiments, the practitioner can set +.>Is a value of (2).

The method comprises the steps of sampling acquired enterprise environment data to obtain a plurality of sampled data, presetting initial neighborhood data of each sampled data, obtaining the discrete degree of the initial neighborhood data of each sampled data, and adjusting the initial neighborhood data of each sampled data according to the discrete degree of the initial neighborhood data of each sampled data to obtain the neighborhood data of each sampled data.

The abnormal score obtaining module 103 of the sampling data obtains the consumption characteristic value and the emission characteristic value corresponding to each sampling data according to the neighborhood data of each sampling data, obtains the correlation coefficient of each pair of typical correlation variables of each sampling data according to the consumption characteristic value and the emission characteristic value corresponding to each sampling data, obtains the multidimensional characteristic of each sampling data, builds an isolated forest according to the multidimensional characteristic of each sampling data to obtain the initial abnormal score of each sampling data, corrects the initial abnormal score of each sampling data, and obtains the abnormal score of each sampling data.

It should be noted that, the purpose of the present invention is to correct the weight coefficient of each sampled data according to the abnormal condition of each sampled data, while the known isolated forest is an algorithm for detecting abnormal values, and is implemented by constructing an isolated binary tree, so in the present invention, the multi-dimensional feature of each sampled data needs to be obtained first, so that the subsequent construction of an isolated forest according to the multi-dimensional feature of each sampled data is facilitated, the abnormal analysis of each sampled data is performed, the above steps obtain the neighborhood data of each sampled data, the discrete degree of the neighborhood data of each sampled data and the sampled data is known to be on a level, and because in a high-dimensional space, the separation between the normal point and the abnormal point becomes more fuzzy, the isolated forest may be affected by the dimension disaster when processing the high-dimensional data, and therefore the multi-dimensional feature of each sampled data needs to be obtained according to the neighborhood data of each sampled data by fusing the consumption data and the emission corresponding to the data of each sampled data.

In the embodiment of the invention, the first is acquiredConsumption characteristic values corresponding to the individual sampling data:

Wherein,represents->Consumption characteristic values corresponding to the sampling data; />Representing the type number of consumption data corresponding to the sampling data; />Represents->Sample data +.>All of the first neighborhood data of the sampling dataAverage of seed consumption data; />Represents->Sample data +.>All of the +.>Standard deviation of seed consumption data, the larger its value, the description +.>Sample data +.>All of the +.>The more unreliable the mean value of the seed consumption data, therefore +.>Sample data +.>All of the +.>The less likely the mean of seed consumption data will react +.>Average level of consumption data corresponding to the individual sampling data; first->Corresponding to the sampled dataIs the +.>Average level of consumption data corresponding to the individual sampling data.

Acquisition of the firstEmission characteristic values corresponding to the sampling data:

/>

in the method, in the process of the invention,represents->Emission characteristic values corresponding to the sampling data; />The number of types of emission data corresponding to the sampling data is represented, and the number of emission data corresponding to the sampling data is consistent with the number of consumption data; / >Represents->Sample data +.>All of the +.>Average of seed emission data; />Represents->Sample data +.>All of the +.>Standard deviation of seed emission data; the larger the value, the description of +.>Sample data +.>All of the +.>The more unreliable the mean value of the seed emission data, therefore +.>Sample data +.>All of the +.>The less likely the mean of species emission data is to reflect +.>Average level of emission data corresponding to the individual sampling data; first->The emission characteristic value corresponding to the sampling data is +.>Average level of emission data corresponding to the sampled data.

Obtaining each sample by using a typical correlation analysis according to the emission characteristic value and the consumption characteristic value corresponding to each sample dataFront of dataFor typical correlation variables, acquiring correlation coefficients of each pair of typical correlation variables of each sampled data; typical correlation analysis is a well-known technique, and is not described in detail in the embodiments of the present invention.

First, theThe multidimensional characteristic of the individual sampled data is +.>Wherein->Represents->The number of neighborhood data of the sampled data; />Represents->Consumption characteristic values corresponding to the sampling data; />Represents->Emission characteristic values corresponding to the sampling data; />Represents->Correlation coefficients for a first pair of representative correlation variables of the sampled data; />Represents->Correlation coefficients for a second pair of representative correlation variables of the sampled data; />Represents->And the correlation coefficients of a third pair of typical correlation variables of the sampled data.

In the embodiment of the invention, the depth maximum value of the preset tree is as followsThe number of sampled data is marked +.>The number of training samples per tree is +.>The number of trees is->And constructing an isolated forest according to the multidimensional characteristics of each sampled data, and acquiring an initial anomaly score of each sampled data according to the constructed isolated forest. It should be noted that, the method for constructing an isolated forest and obtaining the initial anomaly score of each sampled data according to the constructed isolated forest are known techniques, and in the embodiment of the present invention, too much description is not given.

It should be noted that, when the initial anomaly score of any one of the sampled data is high, it is explained that the outlier degree of the sampled data is high, that is, the discrete degree of the sampled data is high compared with the neighborhood data of the sampled data, so that for the sampled data with a high initial anomaly score, the correlation between the discharge amount and the consumption amount is low, so that for the sampled data with a high initial anomaly score, the weight coefficient in the subsequent fitting process should be appropriately reduced, but the initial anomaly score of the sampled data may be inaccurate when the subsequent regression analysis is affected, so that the initial anomaly score of each sampled data needs to be corrected, when the fluctuation of each discharge amount data corresponding to the sampled data and the neighborhood data of the sampled data is small, and when the fluctuation of each consumption amount data corresponding to the sampled data and the neighborhood data of the sampled data is small, the variation of the sampled data and the neighborhood data of the sampled data is relatively stable, and the initial anomaly score of the sampled data needs to be adjusted to be low.

In the embodiment of the invention, the anomaly score of each sampled data is acquired:

in the method, in the process of the invention,represents->Abnormal scores of the individual sampled data; />The number of types of emission data corresponding to the representative sampling data; />Represents->Sample data +.>The +.>Standard deviation of first order difference value of seed emission amount data, the smaller the value thereof, the description of +.>Sample data +.>The neighborhood data of the sampling data corresponds to each emission amount data with small fluctuationNeed to be->Initial anomaly scores of the individual sampled data are reduced; />Representing the type number of consumption data corresponding to the sampling data; />Represents->Sample data +.>The +.>Standard deviation of first order difference value of seed consumption data, the smaller the value thereof, the description of +.>Sample data +.>The fluctuation of the consumption data corresponding to the neighborhood data of the sampling data is small, and the +.>Initial anomaly scores of the individual sampled data are reduced; />Represents->Initial anomaly scores for the individual sampled data; />Representing a normalization function, adopting a linear normalization method, wherein the normalization object is +. >Is a value of (2).

The method comprises the steps of obtaining consumption characteristic values and emission characteristic values corresponding to each piece of sampling data according to neighborhood data of each piece of sampling data, obtaining correlation coefficients of each pair of typical correlation variables of each piece of sampling data according to the consumption characteristic values and the emission characteristic values corresponding to each piece of sampling data, obtaining multi-dimensional characteristics of each piece of sampling data, constructing an isolated forest according to the multi-dimensional characteristics of each piece of sampling data to obtain initial anomaly scores of each piece of sampling data, correcting the initial anomaly scores of each piece of sampling data, and obtaining the anomaly scores of each piece of sampling data.

The regression model obtaining module 104 obtains an anomaly score matrix according to the anomaly score of each sampled data, and obtains a regression model of the PLS partial least square method according to the anomaly score matrix.

After obtaining the anomaly score of each sampled data, the weight coefficient of each sampled data needs to be corrected to obtain an accurate regression model.

In the embodiment of the invention, the final cycle number of the objective function of the partial least square method is obtained by using a K-fold cross validation mode and is recorded as；

Process of obtaining objective function of first cycle:

Acquiring an interpretation variable matrix in the first circulation process: interpreting the size of the variable matrix asWherein->The number of kinds of consumption data corresponding to the sampling data, < >>Representing the number of sampled data; each is processed according to the sequence of the acquisition time of the sampling dataEach consumption data corresponding to the sampling data is used as one row of data of an explanatory variable matrix, and the explanatory variable matrix in the first circulation process is obtained; acquiring an explained variable matrix in the first circulation process: the size of the matrix of interpreted variables is +.>Wherein->The number of kinds of emission data corresponding to the sampling data, < >>Representing the number of sampled data; according to the sequence of the acquisition time of the sampling data, taking each emission data corresponding to each sampling data as one row of data of an interpretation variable matrix to obtain the interpretation variable matrix in the first circulation process; the number of types of emission data corresponding to the sampling data matches the number of types of consumption data.

According to the anomaly score of each sampling data, acquiring an anomaly score matrix:

wherein,an anomaly score representing sample data 1; />An anomaly score representing sample data at 2 nd; Represents->Abnormal scores of the individual sampled data; />Representative of abnormalityAnd (5) dividing the matrix.

According to the explained variable matrix, the explained variable matrix and the abnormality score matrix in the first circulation process, obtaining an objective function of the first circulation:

in the method, in the process of the invention,representing an anomaly score matrix; />Representing an interpretation variable matrix during a first cycle; />Representing an interpreted variable matrix during a first cycle; />A weight coefficient representing the interpreted variable during the first cycle; />A weight coefficient representing an interpretation variable during the first cycle; />The output value of the objective function representing the first cycle of the partial least squares method, it being noted that the weighting factor of the explanatory variable during the first cycle +.>Is thatThe feature vector corresponding to the maximum feature value, the weight coefficient of the explained variable in the first cycle>Is->The feature vector corresponding to the maximum feature value, the weight coefficient of the explained variable and the obtaining of the weight coefficient of the explained variable are known techniques, and in the embodiment of the present invention, redundant description is not performed.

Process of obtaining objective function of second cycle:

according to the interpretation variable matrix, the interpreted variable matrix, the weight coefficient of the interpreted variable and the weight coefficient of the interpretation variable in the first circulation process, the interpretation variable matrix and the interpreted variable matrix in the second circulation process are obtained:

In the method, in the process of the invention,representing an interpretation variable matrix during a first cycle; />Representing an interpreted variable matrix during a first cycle; />A weight coefficient representing an interpretation variable during the first cycle; />Representing an interpretation variable matrix during the second cycle; />Representing an interpreted variable matrix during a second cycle; />The acquisition method of (1) comprises the following steps: />；/>The acquisition method of (1) comprises the following steps: />。

Acquiring an objective function of the second cycle according to the interpretation variable matrix, the interpreted variable matrix and the abnormality score matrix in the second cycle by using the acquisition method of the objective function of the first cycle; and so on until the firstThe objective function of the sub-loop may stop.

According to the firstThe regression model of partial least square method is obtained from the objective function of the sub-cycle, and it should be noted that according to the firstThe regression model of the partial least square method is obtained by the objective function of the sub-cycle, and is the existing method of the partial least square method, and the embodiment is not repeated, and the possible high pollution condition in the subsequent production process is predicted in time according to the regression model of the partial least square method.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the invention, but any modifications, equivalent substitutions, improvements, etc. within the principles of the present invention should be included in the scope of the present invention.

Claims

1. An enterprise environmental remediation data processing system based on big data analysis, the system comprising:

The method comprises the steps of correcting an initial anomaly score of each sampling data according to fluctuation of consumption data and fluctuation degree of discharge data corresponding to the sampling data and neighborhood data of the sampling data, and obtaining the anomaly score of each sampling data, and comprises the following steps:

in the method, in the process of the invention,represents->Abnormal scores of the individual sampled data; />The number of types of emission data corresponding to the representative sampling data; />Represents->Sample data +.>The +.>Standard deviation of first order difference value of seed emission data; />Representing the type number of consumption data corresponding to the sampling data; />Represents->Sample data +.>The +.>Standard deviation of first order differential value of seed consumption data;represents->Initial anomaly scores for the individual sampled data; />Representing a normalization function;

the regression model acquisition module is used for acquiring an abnormal score matrix according to the abnormal score of each piece of sampling data and correcting the objective function to acquire a regression model;

the method comprises the steps of correcting an objective function by acquiring an abnormal score matrix according to the abnormal score of each sampling data, and acquiring a regression model, and comprises the following steps:

Acquiring an interpretation variable matrix in the first circulation process: interpreting the size of the variable matrix asWherein->The number of kinds of consumption data corresponding to the sampling data, < >>Representing the number of sampled data; taking each consumption data corresponding to each sampling data as one row of data of an interpretation variable matrix according to the sequence of the acquisition time of the sampling data to obtain the interpretation variable matrix in the first circulation process; acquiring an explained variable matrix in the first circulation process: the size of the matrix of interpreted variables is +.>，/>The number of kinds of emission data corresponding to the sampling data, < >>Representing the number of sampled data; according to the sequence of the acquisition time of the sampling data, taking each emission data corresponding to each sampling data as one row of data of an interpretation variable matrix to obtain the interpretation variable matrix in the first circulation process;

in the method, in the process of the invention,an anomaly score representing sample data 1; />An anomaly score representing sample data at 2 nd; />Represents->Abnormal scores of the individual sampled data; />Representing an anomaly score matrix;

will be The feature vector corresponding to the maximum feature value is used as a weight coefficient of an interpretation variable in the first circulation process; will->Feature direction corresponding to maximum feature valueThe amount, as a weight coefficient of the interpreted variable during the first cycle;

in the method, in the process of the invention,representing an interpretation variable matrix during a first cycle; />Representing an interpreted variable matrix during a first cycle; />A weight coefficient representing an interpretation variable during the first cycle; />Representing an interpretation variable matrix during the second cycle; />Representing an interpreted variable matrix during a second cycle; />The acquisition method of (1) is->；The acquisition method of (1) is->；

2. The big data analysis based enterprise environmental remediation data processing system of claim 1, wherein the dividing the enterprise environmental data into segments comprises the steps of:

3. The system for processing environmental management data of an enterprise based on big data analysis according to claim 1, wherein the step of obtaining the discrete degree of the initial neighborhood data of each sampled data based on the consumption data and the discharge data corresponding to the sampled data and the initial neighborhood data of the sampled data comprises the steps of:

taking each consumption data and each discharge data as one dimension in the multidimensional space respectively, taking the firstThe sampled data are converted into a coordinate point in the multidimensional space, which is marked as +.>Multidimensional coordinate points of the sampled data; will be->Sample data->Converting the initial neighborhood data into a coordinate point in the multidimensional space, and recording as +.>Sample data- >Multidimensional coordinate points of the initial neighborhood data;

in the method, in the process of the invention,represents->The degree of discretization of the initial neighborhood data of the individual sample data; />Represents->The number of initial neighborhood data of the individual sample data; />Represents->Multidimensional coordinate point of the sampled data and +.>Sample data->Euclidean distance between multidimensional coordinate points of the initial neighborhood data.

4. The system for processing environmental remediation data of an enterprise based on big data analysis of claim 1, wherein the step of deleting or expanding the initial neighborhood data of each sampled data based on the degree of discretization of the initial neighborhood data of each sampled data to obtain the neighborhood data of each sampled data includes the steps of:

acquiring the average value of the discrete degree of the initial neighborhood data of all the sampled data, and recording asPreset step size->When->The degree of dispersion of the initial neighborhood data of the individual sample data is greater than or equal to +.>At the time, the step length is +.>For->The first deletion is carried out at the two ends of the segment where the sampling data are positioned to obtain the +.>First puncturing of the sample data, will be +.>Enterprise environment data in the first pruned segment of the sample data is denoted as +.>First deleting neighborhood data of each sample data, obtaining +. >The degree of discretization of the first truncated neighborhood data of the sampled data, when +.>The degree of discretization of the first truncated neighborhood data of the sampled data is greater than or equal to +.>At the time, taking step length asFor->Second pruning is performed on both ends of the first pruned segment of the sampled data to obtain +.>Second truncated segment of the sampled data, will be +.>Enterprise environmental data in the second pruned segment of the sampled data is noted +.>Second pruning neighborhood data of each sample data, obtaining +.>Second pruning of sample data neighborhood data discretization degree and +.>Make a comparison, and so on, until +.>Sample data ofThe degree of dispersion of the sub-pruned neighborhood data is less than +.>Stop at time, will be->Sample data->Hypopruned neighborhood data as +.>Neighborhood data of the individual sample data;

when the first isThe degree of dispersion of the initial neighborhood data of the sample data is smaller than +.>At the time, the step length is +.>For->Expanding the two ends of the section where the sampling data are located to obtain the +.>The first expansion of the sample data will be +.>The enterprise environment data in the first expansion of the sample data is denoted as +. >First expansion neighborhood data of each sample data, obtaining +.f. according to the method of obtaining the discrete degree of the initial neighborhood data of each sample data>The degree of discretization of the first expansion neighborhood data of the sample data, when +.>The degree of dispersion of the first expansion neighborhood data of the sampled data is less than + ->At the time, the step length is +.>For->The two ends of the first expansion section of the sampled data are expanded to obtain the +.>Second expansion of the sample data, will be +.>The enterprise environment data in the second expansion section of the sample data is marked as +.>Second expanding neighborhood data of the sampled data; and so on, up to +.>Sample data->The degree of dispersion of the secondary expansion neighborhood data is greater than or equal to +.>When it is, will be->Sample data->Sub-expansion neighborhood data as +.>Neighborhood data of the sampled data.

5. The system for processing environmental management data of an enterprise based on big data analysis according to claim 1, wherein the step of obtaining the consumption characteristic value corresponding to each sampled data according to the sampled data and the consumption data corresponding to the neighborhood data of the sampled data comprises the steps of:

wherein, Represents->Consumption characteristic values corresponding to the sampling data; />Representing the type number of consumption data corresponding to the sampling data; />Represents->Sample data +.>All of the +.>Average of seed consumption data; />Represents->Sample data +.>All of the +.>Standard deviation of consumption data.

6. The system for processing environmental management data of an enterprise based on big data analysis according to claim 1, wherein the step of obtaining the characteristic value of the discharge amount corresponding to each sampled data according to the sampled data and the discharge amount data corresponding to the neighborhood data of the sampled data comprises the steps of:

7. The system for processing environmental remediation data of an enterprise based on big data analysis of claim 1, wherein the step of obtaining the correlation coefficient of each pair of typical correlation variables for each sampled data based on the emission characteristic value and the consumption characteristic value corresponding to each sampled data includes the steps of:

8. The system for processing environmental remediation data of an enterprise based on big data analysis of claim 1, wherein the steps of constructing an isolated forest from the multi-dimensional characteristics of each sampled data, obtaining an initial anomaly score for each sampled data, include: