CN114662698A

CN114662698A - Industrial internet multi-modal machine learning data processing method

Info

Publication number: CN114662698A
Application number: CN202210129788.0A
Authority: CN
Inventors: 吴斌; 王雪峰; 刘青
Original assignee: Nanjing Inrich Technology Co ltd
Current assignee: Nanjing Inrich Technology Co ltd
Priority date: 2022-02-11
Filing date: 2022-02-11
Publication date: 2022-06-24

Abstract

The invention discloses an industrial internet multi-modal machine learning data processing method, and relates to the technical field of industrial internet. The industrial internet multi-modal machine learning data processing method comprises the following specific methods: calculating the correlation between every two multi-mode data sets, firstly, cleaning data to align the data in time, and judging whether the two data sets are correlated: otherwise: judging whether all data are processed or not; the method comprises the following steps: selecting a proper data set as modeling data, and judging whether all data are processed; and step two, restarting the step one if all the data are not processed, and establishing a proper multi-mode machine learning model if all the data are processed. The invention is beneficial to selecting different multiple data sources aiming at different scenes, effectively saves the system cost, reduces the machine learning model and is convenient for the implementation of edge calculation.

Description

Industrial internet multi-modal machine learning data processing method

Technical Field

The invention relates to the technical field of industrial internet, in particular to an industrial internet multi-mode machine learning data processing method.

Background

In the prior art, after a large number of terminals are introduced in an industrial internet scene, collected data can come from different data sources, for example, if a machine model is to be established whether a transformer substation in a power grid operates normally, temperature and humidity at different times, content of specific gas after transformer oil separation, visible light data (video and images), infrared thermodynamic diagram type data (data shot by a thermal imaging sensor), sound, smell and the like can be collected, and when a plurality of data sources exist, establishing a multi-mode machine learning model by using a plurality of data sources is an existing method for utilizing a related data set. However, how to measure the value of each data source in the model is not studied too much in the prior art, which is not beneficial to selecting different data sources for different scenes, and causes high system cost.

Disclosure of Invention

The technical problem to be solved by the invention is how to measure the value of each data source in the model without much research in the prior art.

In order to solve the technical problems, the invention adopts a technical scheme that: the method for processing the industrial internet multi-modal machine learning data comprises the following specific steps:

calculating the correlation between every two multi-mode data sets, firstly, cleaning data to align the data in time, and judging whether the two data sets are correlated:

otherwise: judging whether all data are processed or not;

the method comprises the following steps: selecting a proper data set as modeling data, and judging whether all data are processed;

and step two, restarting the step one if all the data are not processed, and establishing a proper multi-mode machine learning model if all the data are processed.

Preferably, the method for clearing the data in the first step so that the data are aligned in time is as follows: setting a fixed time interval in the same period of time aiming at the time alignment of all data, taking all data at each time point as cleaning output, and obtaining samples through calculation of front and back data if a certain data source at the time point has no data.

Preferably, the specific method of obtaining the sample is as follows: let the horizontal axis be the time axis, X be the sampling time point to be calculated, and the preceding and following data be (X)₀,y₀)，(x₁,y₁) And the y value calculation formula of the sampling point is as follows:

preferably, the correlation is calculated as follows: there are two expression methods for correlation, one is covariance, the other is correlation coefficient, and the correlation coefficient can be regarded as normalized covariance, and let: x_tFor the first set of cleaned data, Y_tFor the second set of cleaned data, μ_xIs X_tMean value of (d) (. mu.)_yIs Y_tMean value of (a)_xIs X_tStandard deviation of (a)_yIs Y_tStandard deviation of (E [. cndot.)]To calculate expectation, said X_tAnd Y_tHas a covariance of Cov (X)_t,Y_t)，Cov(X_t,Y_t)＝E[(X_t-μ_x)(Y_t-μ_y)^T]Said X is_tAnd Y_tHas a correlation coefficient of Cor (X)_t,Y_t)，

Preferably, the method for determining the threshold related to the two data sets comprises: the value of the correlation coefficient is between-1 and 1, and as long as the absolute value of the correlation coefficient is greater than the threshold, one of the two data is selected to participate in training the multi-mode model.

Preferably, the method for selecting a suitable data set as modeling data in the first step is as follows: the method comprises the steps of utilizing test data to respectively test the contribution of a machine model which participates in training of two data sets to a detection result, and selecting a data set with good performance, wherein the machine model can be formed by independently training the two data sets, or can be formed by independently training the two data sets and other same data, and the machine learning model comprises but is not limited to a decision tree, a random forest, a linear regression, a naive Bayes, a neural network (including a deep learning neural network), a logistic regression and a support vector machine.

The invention has the following beneficial effects:

according to the invention, through establishing a proper multi-mode machine learning model, different multiple data sources can be selected for different scenes, the system cost is effectively saved, and meanwhile, the machine learning model is reduced, and the implementation of edge calculation is facilitated.

Drawings

Fig. 1 is a flowchart of an industrial internet multimodal machine learning data processing method according to the present invention.

Detailed Description

The following detailed description of the preferred embodiments of the present invention, taken in conjunction with the accompanying drawings, will make the advantages and features of the present invention more comprehensible to those skilled in the art, and will thus provide a clear and concise definition of the scope of the present invention.

Referring to fig. 1, the industrial internet multimodal machine learning data processing method includes the following specific methods:

calculating the correlation between every two multi-modal data sets, firstly cleaning the data, and aligning the data in time as follows: setting a fixed time interval in the same period of time aiming at the time alignment of all data, taking all data at each time point as cleaning output, if a certain data source at the time point has no data, obtaining samples through front and back data calculation, wherein the specific method for obtaining the samples comprises the following steps: let the horizontal axis be the time axis, X be the sampling time point to be calculated, and the preceding and following data be (X)₀,y₀)，(x₁,y₁) And the y value calculation formula of the sampling point is as follows:

so that the data are aligned in time, and whether the two data sets are related is judged:

otherwise: judging whether all data are processed or not;

comprises the following steps: selecting a proper data set as modeling data, and judging whether all data are processed;

step two, restarting the step one if all the data are not processed, and establishing a proper multi-modal machine learning model if all the data are processed;

the correlation is calculated as follows: there are two expression methods for correlation, one is covariance, the other is correlation coefficient, and the correlation coefficient can be regarded as normalized covariance, and let: x_tFor the first set of cleaned data, Y_tFor the second set of cleaned data, μ_xIs X_tMean value of (d) (. mu.)_yIs Y_tMean value of (a)_xIs X_tStandard deviation of (a)_yIs Y_tStandard deviation of (E [. cndot.)]To calculate expectation, X_tAnd Y_tHas a covariance of Cov (X)_t,Y_t)，Cov(X_t,Y_t)＝E[(X_t-μ_x)(Y_t-μ_y)^T]，X_tAnd Y_tHas a correlation coefficient of Cor (X)_t,Y_t)，

The method for selecting a proper data set as modeling data in the first step comprises the following steps: the method comprises the steps of utilizing test data to respectively test the contribution of a machine model which participates in training of two data sets to a detection result, and selecting a data set with good performance, wherein the machine model can be formed by independently training the two data sets, or can be formed by independently training the two data sets and other same data, and the machine learning model comprises but is not limited to a decision tree, a random forest, a linear regression, a naive Bayes, a neural network (including a deep learning neural network), a logistic regression and a support vector machine.

The method for determining the threshold related to the two data sets comprises the following steps: the value of the correlation coefficient is between-1 and 1, and as long as the absolute value of the correlation coefficient is greater than the threshold, one of the two data is selected to participate in training the multi-mode model.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes performed by the present specification and drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. The industrial internet multi-modal machine learning data processing method is characterized by comprising the following specific methods:

otherwise: judging whether all data are processed or not;

2. The industrial internet multimodal machine learning data processing method as claimed in claim 1, wherein in the first step, the data needs to be cleaned, so that the data are aligned in time by the following method: setting a fixed time interval in the same period of time aiming at the time alignment of all data, taking all data at each time point as cleaning output, and obtaining samples through calculation of front and back data if a certain data source at the time point has no data.

3. The industrial internet multimodal machine learning data processing method according to claim 2, wherein the specific method of obtaining samples is as follows: let the horizontal axis be the time axisX is the sampling time point to be calculated, and the preceding and following data are (X)₀，y₀)，(x₁，y₁) And the y value calculation formula of the sampling point is as follows:

4. the industrial internet multimodal machine learning data processing method according to claim 1, wherein the correlation is calculated as follows: there are two expression methods for correlation, one is covariance, the other is correlation coefficient, and the correlation coefficient can be regarded as normalized covariance, and let: x_tFor the first set of cleaned data, Y_tFor the second set of cleaned data, μ_xIs X_tMean value of (a), mu_yIs Y_tMean value of (a)_xIs X_tStandard deviation of (a)_yIs Y_tStandard deviation of (E [. cndot.)]To calculate expectation, said X_tAnd Y_tHas a covariance of Cov (X)_t，Y_t)，Cov(X_t，Y_t)＝E[(X_t-μ_x)(Y_t-μ_y)^T]Said X is_tAnd Y_tHas a correlation coefficient of Cor (X)_t，Y_t)，

5. The industrial internet multimodal machine learning data processing method according to claim 1, wherein the threshold related to the two data sets is determined by: the value of the correlation coefficient is between-1 and 1, and as long as the absolute value of the correlation coefficient is greater than the threshold, one of the two data is selected to participate in training the multi-mode model.

6. The industrial internet multi-modal machine learning data processing method as claimed in claim 1, wherein the method of selecting a suitable data set as modeling data in the first step is: the method comprises the steps of utilizing test data to respectively test the contribution of a machine model which participates in training of two data sets to a detection result, and selecting a data set with good performance, wherein the machine model can be formed by independently training the two data sets, or can be formed by independently training the two data sets and other same data, and the machine learning model comprises but is not limited to a decision tree, a random forest, a linear regression, a naive Bayes, a neural network (including a deep learning neural network), a logistic regression and a support vector machine.

7. The industrial internet multimodal machine learning data processing method according to claim 1, wherein the specific method for establishing the suitable multimodal machine learning model is as follows:

when selecting data sets pairwise, selecting the data set with a large data amount, finally obtaining all data, and retraining a model by using the cleaned data, if the model is established during selection, using the model corresponding to the selected data for further use, for example, connecting all the independent models corresponding to the selected data in parallel for use, or directly outputting all the models trained by the selected data for use.