CN115099370A

CN115099370A - Evaluation data set construction method and system for flow type industrial production data flow

Info

Publication number: CN115099370A
Application number: CN202211014655.5A
Authority: CN
Inventors: 南玉泽; 王栋; 党海峰; 夏建涛
Original assignee: Beijing Quanying Technology Co ltd
Current assignee: Beijing Quanying Technology Co ltd
Priority date: 2022-08-23
Filing date: 2022-08-23
Publication date: 2022-09-23
Anticipated expiration: 2042-08-23
Also published as: CN115099370B

Abstract

The invention relates to a method and a system for constructing an evaluation data set for a flow-type industrial production data stream, wherein the method comprises the following steps: s1, selecting a time series list

And L ₀ (ii) a S2, adopting a distance similarity screening strategy to obtain L ₀ Medium dependent variables y and L _i Obtaining the similarity list by the distance similarity of the medium dependent variable y; constructing a first trend data set D1 based on the similarity list and a preset construction mode; s3, selecting a sequence X with a specified length according to historical data of the production data stream, and acquiring the period T of the sequence X by adopting an autocorrelation coefficient processing mode ₃ Based on T ₃ Generating a time-stamp list based on the elements in the time-stamp listAnd a preset construction mode, constructing a second trend data set D2; s4, adopting a time sequence list

Acquiring an error sequence from the trained model to be evaluated, and constructing a third trend data set D3 based on the error sequence and a preset construction mode; and S5, sampling the D1, the D2 and the D3 to obtain an evaluation data set.

Description

Evaluation data set construction method and system for flow type industrial production data flow

Technical Field

The invention relates to the technical field of process type industrial production, in particular to a method and a system for constructing an evaluation data set for a process type industrial production data stream.

Background

For a machine learning modeling project under a flow type industrial production scene, the effectiveness evaluation of a model should run through model training, model updating and on-line operation, and whether the effectiveness evaluation of a mathematical model is reasonable is established on the basis of a correct evaluation mode and an effective evaluation data set, that is, to realize the effect evaluation of the full life cycle of the model, a reasonable evaluation data set establishment method should be provided in addition to a correct evaluation method to ensure that data used for model testing are sufficient and correct.

The existing evaluation data set is mainly constructed by a leave-out method, a cross-validation method and a self-service method. However, in a process-type industrial production scene, an evaluation data set established by the existing method cannot well reflect the real data distribution situation of the model at the use time because the evaluation data set does not have the trend and the periodic characteristics of a time sequence, the validity and the generalization capability of the model cannot be correctly reflected on the data set, and the test on the model causes distortion because the evaluation data set is invalid.

Disclosure of Invention

Technical problem to be solved

In view of the above disadvantages and shortcomings of the prior art, the present invention provides a method and a system for constructing an evaluation data set for a flow-type industrial production data stream, which solve the technical problems that the existing evaluation data set does not have the characteristics of time-series trend and periodicity, and the validity and generalization capability of a model to be evaluated cannot be correctly reflected when the evaluation data set is evaluated.

(II) technical scheme

In order to achieve the purpose, the invention adopts the main technical scheme that:

the embodiment of the invention provides a method for constructing an evaluation data set for a process type industrial production data stream, which comprises the following steps:

s1, selecting a time sequence list [ L ] on a time axis according to time sequence data of the production data stream ₁ ， L ₂ ，L _i ， …，L _N ]And a specified sequence L ₀ ，L ₀ The corresponding time length and the update period T of the model to be evaluated ₀ Are identical and specify the sequence L ₀ Comprising [ T-T ₀ ，t ]Generating element data of a data stream in a time period, wherein t is a current timestamp;

s2, adopting a distance similarity screening strategy to obtain L ₀ Intermediate dependent variables y and L _i The distance similarity of the medium dependent variable y and a similarity list is obtained; constructing a first trend data set D1 in the time series data of the production data stream based on the similarity list and a preset construction mode;

s3, selecting a sequence X with a specified length according to historical data of the production data stream, and acquiring the period T of the sequence X by adopting an autocorrelation coefficient processing mode ₃ Based on said T ₃ Generating a time stamp list with a preset length k2, and constructing a second trend data set D2 in the time sequence data of the production data stream according to elements in the time stamp list and a preset construction mode;

s4, adopting a time sequence list L ₁ ， L ₂ ，L _i ， …，L _N ]Acquiring an error sequence from the trained model to be evaluated, and constructing a third trend data set D3 in the time sequence data of the production data stream based on the error sequence and a preset construction mode;

and S5, sampling the D1, the D2 and the D3 to obtain an evaluation data set for evaluating the model to be evaluated.

Preferably, the first and second electrodes are formed of a metal,

the specified sequence L ₀ The method comprises the following steps: z is a radical of ₀₁ ，z ₀₂ ，...z _0w ...z _0n ；

z _0w To specify the sequence L ₀ W-th element data arranged in time series;

the time series L _i The method comprises the following steps: a predetermined first time interval T ₁ M pieces of element data arranged in chronological order in time series data of the production data stream before the current timestamp t in the stream;

wherein L is _i =[z _i1 ，z _i2 ，...z _ij ... z _im ]；

z _ij Is a time sequence L _i The jth element data arranged according to the time sequence;

wherein,

f is a preset value;

each element data includes: the time stamp corresponding to the element data, and the independent variable and the dependent variable y corresponding to the preset model to be evaluated.

Preferably, the S2 specifically includes:

s21, adopting a distance similarity screening strategy based on the designated sequence L ₀ And time series List [ L ₁ ， L ₂ ，L _i ， …，L _N ]Separately obtaining the designated sequences L ₀ And time series List [ L ₁ ， L ₂ ，L _i ， …，L _N ]Distance similarity between each time series in (a);

s22, based on the designated sequence L ₀ And time series List [ L ₁ ， L ₂ ，L _i ， …，L _N ]Obtaining a similarity list according to the distance similarity between each time sequence;

the similarity list includes: time series List [ L ₁ ， L ₂ ，L _i ， …，L _N ]The time sequences corresponding to the medium K1 maximum distance similarities respectively;

wherein K1 is a preset value, and K1 is more than or equal to 0 and is more than or equal to N/10;

s23, acquiring a first trend data set D1 in the historical production operation time sequence data based on the similarity list and a preset construction mode.

Preferably, the S21 specifically includes:

s211, aiming at the designated sequence L ₀ And a time series L _i Obtaining the specified sequence L ₀ And a time series L _i Corresponding distance matrix D _{（L0，Li）} ；

Wherein,

；

wherein,

；

to specify the sequence L ₀ The dependent variable y in the w-th element data arranged in time sequence;

is a time sequence L _i The dependent variable y in the jth element data arranged in time sequence;

s212, based on the distance matrix D _{（L0，Li）} Recursion formula (1) is adopted to recur the distance matrix D _{（L0，Li）} Middle element d ₁₁ To element d _mn Minimum distance L therebetween _min (m, n) and the minimum distance L _min (m, n) as a designated sequence L ₀ And a time series L _i Distance similarity between them;

the formula (1) is:

；

wherein L is _min (w, j) is the element d in the distance matrix ₁₁ To any element d in the distance matrix _wj The minimum distance of (a);

wherein,

；

；

。

preferably, the S23 specifically includes:

s231, acquiring a first timestamp set based on the similarity list;

the first set of timestamps includes: the time stamp corresponding to the last element data in each time sequence in the similarity list;

and S232, with each timestamp in the first timestamp set as a starting point, respectively backward acquiring element data in the production running time sequence data in the period T0, merging the element data, and acquiring a first trend data set D1.

Preferably, S3 specifically includes:

s31, acquiring two subdata sets corresponding to the sequence X with the specified length in any one preset translation time segment based on the sequence X with the specified length and p preset translation time segments;

the sequence X of the specified length comprises: a second time interval T ₂ H element data arranged according to a time sequence in the internal historical production operation time sequence data;

wherein the second time interval T ₂ The internal period is more than or equal to 15 days;

X=[z ₁ ，z ₂ ，...z _r ...，z _h ]；

z _r for a second time interval T ₂ The r-th element data are arranged in time sequence in the internal historical production operation time sequence data;

the p preset valuesThe translation time slice comprises in sequence: t is t ₁ 、t ₂ 、...t _g ...t _p ；

Wherein p is more than or equal to 0 and less than or equal to 30;

wherein, t _g The method comprises the steps of obtaining a g-th preset translation time segment in p preset translation time segments;

the two sub data sets corresponding to any preset translation time segment of the sequence X with the specified length are respectively a first sub data set of the any preset translation time segment and a second sub data set of the any preset translation time segment;

the first sub data set of any one of the predetermined panning time segments comprises: at said second time interval T ₂ The start time of the translation time segment is used as the start time of any one preset translation time segment, and the element data in the sequence X with the specified length in any one preset translation time segment is used as the element data in the sequence X with the specified length;

the second sub data set of any one of the predetermined panning time segments comprises: at said second time interval T ₂ The starting time of the translation time segment is used as the starting time of any one preset translation time segment, and the element data in the sequence X with other specified lengths in the translation time segment are divided;

s32, respectively acquiring autocorrelation coefficients of two sub data sets corresponding to a sequence X with a specified length in any preset translation time segment by adopting a formula (2);

the formula (2) is:

for a sequence X of a given length within a predetermined translation time segment t _g The autocorrelation coefficients of the corresponding two subdata sets;

the mean value of the dependent variable y in the element data of the sequence X with the specified length;

X _r the method comprises the steps of (1) obtaining a dependent variable y in the r-th element data in a sequence X with a specified length;

for a sequence X of a given length within a predetermined translation time segment t _g A dependent variable y in the r-th element data in the corresponding first sub data set;

for a sequence X of a given length within a predetermined translation time segment t _g Dependent variable y in the r-th element data in the corresponding second sub data set;

s33, determining a first time period T corresponding to the sequence X with the specified length based on the autocorrelation coefficients of two sub data sets corresponding to the sequence X with the specified length in any preset translation time slice ₃ ；

Wherein the first time period T ₃ Is the corresponding time interval between two adjacent wave crests in the first curve;

the first curve is formed by connecting autocorrelation coefficients of two sub-data sets corresponding to all obtained sequences X with specified lengths in any preset translation time segment according to the arrangement sequence of the corresponding translation time segments;

s34, at the current time T, every other first time period T ₃ Acquiring element data to obtain a timestamp corresponding to the element data to obtain a timestamp list A with K2 timestamps ₂ ；

Wherein,

(ii) a K2 is more than or equal to 0 and less than or equal to h/50, wherein h is the number of element data of the sequence X with the specified length;

s35, list A based on the time stamps ₂ Determining a second trend data set D2;

the S35 specifically includes:

by time stamp list A ₂ Taking each timestamp as a starting point, and respectively obtaining a first time period T backwards ₃ And merging the internal element data to obtain a second trend data set D2.

Preferably, the S4 specifically includes:

s41, list of time series [ L ] to be selected on time axis ₁ ， L ₂ ，L _i ， …，L _N ]Inputting each element data in each time sequence into a trained model to be evaluated for prediction, and obtaining a prediction result of each element data in each time sequence;

wherein the trained model to be evaluated is previously determined by the specified sequence L ₀ The element data in (1) is trained;

s42, acquiring the total error of each time sequence based on the prediction result of each element data in each time sequence and the actual dependent variable y;

s43, determining the time sequence with the minimum K3 total errors based on the total error of each time sequence;

wherein K3 is a preset value, K3 is more than or equal to 0 and is more than or equal to N/10;

s44, acquiring a third trend data set D3 based on the K3 time series with the minimum total error;

the S44 specifically includes:

respectively taking the time stamp of the last element data in the time sequence with the minimum K3 total errors as a starting point, and respectively obtaining time periods T backwards ₀ The element data in the time series data of the production operation in the production process is merged to obtain a third trend data set D3.

Preferably, the S42 specifically includes:

acquiring a total error of each time sequence by adopting a formula (3) based on a prediction result of each element data in each time sequence and the value of an actual dependent variable y;

the formula (3) is:

；

wherein,

the predicted value of the element data dependent variable y is obtained;

y is a dependent variable y in the element data;

m is a time sequence L _i The number of element data of (a);

e _i is a time sequence L _i Total error of (2).

Preferably, the S5 specifically includes:

s51, aiming at the first trend data set D1, the second trend data set D2 and the third trend data set D3, respectively, in a first proportion w ₁ A second proportion w ₂ A third proportion w ₃ Sampling to obtain a first trend sample set D1, a second trend sample set D2 and a third trend sample set D3;

and S52, merging the first trend sample set D1, the second trend sample set D2 and the third trend sample set D3, and performing deduplication processing to obtain a final evaluation data set D.

On the other hand, the embodiment further provides an evaluation data set construction system for the process-type industrial production data stream, which includes:

at least one processor; and

at least one memory communicatively coupled to the processor, wherein the memory stores program instructions executable by the processor, and the processor invokes the program instructions to perform any of the above described profiling dataset construction methods for flowsheet industrial process data streams.

(III) advantageous effects

The invention has the beneficial effects that: the invention relates to a method and a system for constructing an evaluation data set oriented to a flow-type industrial production data streamA data set D1; obtaining the period T of the sequence X by utilizing an autocorrelation coefficient processing mode ₃ Through a period T ₃ Further constructing a second trend data set D2 and a third trend data set D3 through the trained model to be evaluated; and finally, the data sets with different characteristics are included in the evaluation data set of the model to be evaluated, so that the constructed evaluation data set can better reflect the prediction accuracy and generalization capability of the model at a specific use moment.

Drawings

FIG. 1 is a flow chart of an evaluation data set construction method for a flow-type industrial production data flow according to the present invention;

FIG. 2 is a schematic diagram of error distribution of a model to be evaluated on time series data of an actual production data stream;

fig. 3 is a schematic diagram of error distribution of a model to be evaluated on an evaluation data set constructed by the evaluation data set construction method facing a flow-type industrial production data flow in this embodiment;

fig. 4 is a schematic diagram of error distribution of a model to be evaluated on an evaluation data set constructed by a conventional outflow method.

Detailed Description

For the purpose of better explaining the present invention and to facilitate understanding, the present invention will be described in detail by way of specific embodiments with reference to the accompanying drawings.

In order to better understand the above technical solutions, exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

Example one

Referring to fig. 1, the embodiment provides a method for constructing an evaluation data set for a flow-type industrial production data stream, including:

s1, selecting time series data of production data flow on time axisTime series List [ L ₁ ， L ₂ ，L _i ， …，L _N ]And a specified sequence L ₀ ，L ₀ The corresponding time length and the update period T of the model to be evaluated ₀ Are identical and specify the sequence L ₀ Comprising [ T-T ₀ ，t ]And generating element data of the data stream in the time period, wherein t is the current time stamp.

In practical application of this embodiment, the specified sequence L ₀ The method comprises the following steps: z is a radical of ₀₁ ，z ₀₂ ，...z _0w ...z _0n 。

z _0w To specify the sequence L ₀ W-th element data arranged in time series.

The time series L _i The method comprises the following steps: a predetermined first time interval T ₁ M pieces of element data arranged in chronological order in time series data of the production data stream before the current time stamp t in the inner.

Wherein L is _i =[z _i1 ，z _i2 ，...z _ij ... z _im ]。

z _ij Is a time sequence L _i In chronological orderjThe individual element data.

Wherein,

and F is a preset value.

S2, adopting a distance similarity screening strategy to obtain L ₀ Intermediate dependent variables y and L _i Distance similarity of the medium dependent variable y and obtaining a similarity list; and constructing a first trend data set D1 in the time series data of the production data stream based on the similarity list and a preset construction mode.

The S2 specifically includes:

s21, adopting a distance similarity screening strategy based on the designated sequence L ₀ And time series List [ L ₁ ， L ₂ ，L _i ， …，L _N ]Separately obtaining the designated sequences L ₀ And time series List [ L ₁ ， L ₂ ，L _i ， …，L _N ]The distance similarity between each time series.

Specifically, the S21 specifically includes:

s211, aiming at the designated sequence L ₀ And a time series L _i Obtaining a specified sequence L ₀ And a time series L _i Corresponding distance matrix D _{（L0，Li）} 。

Wherein,

。

wherein,

。

to specify the sequence L ₀ In chronological orderwDependent variable y in individual element data.

Is a time sequence L _i In chronological orderjDependent variable y in individual element data.

S212, based on the distance matrix D _{（L0，Li）} Recursion formula (1) is adopted to recur the distance matrix D _{（L0，Li）} Middle element d ₁₁ To element d _mn A minimum distance L therebetween _min (m, n) and the minimum distance L _min (m, n) as a designated sequence L ₀ And a time series L _i The distance similarity between them.

The formula (1) is:

。

wherein L is _min (w, j) is the element d in the distance matrix ₁₁ To any element d in the distance matrix _wj The minimum distance of (c).

Wherein,

，

，

。

s22, based on the designated sequence L ₀ And time series List [ L ₁ ， L ₂ ，L _i ， …，L _N ]The distance similarity between each time series in the time series is obtained, and a similarity list is obtained.

The similarity list includes: time series List [ L ₁ ， L ₂ ，L _i ， …，L _N ]And the medium K1 maximum distance similarity respectively correspond to the time series.

Wherein K1 is a predetermined value, and K1 is not less than 0 and not more than N/10.

Specifically, the S23 specifically includes:

s231, acquiring a first timestamp set based on the similarity list.

The first set of timestamps includes: and the time stamp corresponding to the last element data in each time sequence in the similarity list.

S232, taking each timestamp in the first timestamp set as a starting point, and respectively obtaining a period T backwards ₀ The element data in the time series data of the production operation are merged to obtain a first trend data set D1.

In the embodiment, the similarity of two time series with different lengths is quantified, and the periodic data set is combined to better accord with the characteristics of the time series data.

S3, selecting a sequence X with a specified length according to historical data of the production data stream, and acquiring the period T of the sequence X by adopting an autocorrelation coefficient processing mode ₃ Based on said T ₃ And generating a time stamp list with a preset length k2, and constructing a second trend data set D2 in the time sequence data of the production data stream according to elements in the time stamp list and a preset construction mode.

In practical application of this embodiment, S3 specifically includes:

s31, based on the sequence X with the specified length and p preset translation time slices, obtaining two sub data sets corresponding to the sequence X with the specified length in any preset translation time slice.

The sequence X of the specified length comprises: a second time interval T ₂ Arranged in chronological order in time series data of internal historical production runshThe individual element data.

Wherein the second time interval T ₂ The internal period is more than or equal to 15 days.

X=[z ₁ ，z ₂ ，...z _r ...，z _h ]。

z _r For a second time interval T ₂ The r-th element data in the time sequence of the internal historical production operation time sequence data.

The p preset translation time slices sequentially include: t is t ₁ 、t ₂ 、...t _g ...t _p 。

Wherein p is more than or equal to 0 and less than or equal to 30.

Wherein, t _g For the g-th predetermined panning time period of the p predetermined panning time periods.

Two sub data sets corresponding to any preset translation time segment of the sequence X with the specified length are respectively a first sub data set of the any preset translation time segment and a second sub data set of the any preset translation time segment.

The renA first sub data set of a predetermined shift time slice, comprising: at said second time interval T ₂ Is used as the start time of any one of the preset translation time segments, and the element data in the sequence X with the specified length in any one of the preset translation time segments.

The second sub data set of any one of the predetermined panning time segments comprises: at said second time interval T ₂ Is taken as the start time of any one of the preset translation time segments, and is divided by the element data in the sequence X with other specified length in any one of the preset translation time segments.

S32, obtaining the autocorrelation coefficients of two sub data sets corresponding to the sequence X with the specified length in any preset translation time segment by adopting a formula (2).

The formula (2) is:

。

for a sequence X of a given length within a predetermined translation time segment t _g The autocorrelation coefficients of the corresponding two subdata sets.

Is the mean of the dependent variable y in the element data of a sequence X of a specified length.

X _r Is the dependent variable y in the r-th element data in the sequence X with the specified length.

For a sequence X of a given length within a predetermined translation time segment t _g And the dependent variable y in the r-th element data in the corresponding first sub data set.

For a sequence X of a given length within a predetermined translation time segment t _g And the dependent variable y in the r-th element data in the corresponding second sub data set.

S33, determining a first time period T corresponding to the sequence X with the specified length based on the autocorrelation coefficients of two sub data sets corresponding to the sequence X with the specified length in any preset translation time segment ₃ 。

Wherein the first time period T ₃ Is the corresponding time interval between two adjacent peaks in the first curve.

The first curve is formed by connecting autocorrelation coefficients of two sub data sets corresponding to all obtained sequences X with specified lengths in any preset translation time segment according to the arrangement sequence of the corresponding translation time segments.

S34, at current time T, every other first time period T ₃ Acquiring element data to obtain a timestamp corresponding to the element data to obtain a timestamp list A with K2 timestamps ₂ 。

Wherein,

(ii) a K2 is more than or equal to 0 and less than or equal to h/50, wherein h is the number of element data of the sequence X with the specified length.

S35, list A based on the time stamps ₂ A second trend data set D2 is determined.

The S35 specifically includes:

S4, adopting a time sequence list L ₁ ， L ₂ ，L _i ， …，L _N ]And training the model to be evaluated, obtaining an error sequence based on the error sequenceAnd a preset construction mode, constructing a third trend data set D3 in the time series data of the production data stream.

The S4 specifically includes:

s41, list of time series [ L ] to be selected on time axis ₁ ， L ₂ ，L _i ， …，L _N ]Inputting each element data in each time sequence into a trained model to be evaluated for prediction, and obtaining a prediction result of each element data in each time sequence.

Wherein the trained model to be evaluated is previously determined by the specified sequence L ₀ Is trained.

And S42, acquiring the total error of each time series based on the prediction result of each element data in each time series and the actual dependent variable y.

The S42 specifically includes:

the total error of each time series is obtained by using formula (3) based on the prediction result of each element data in each time series and the value of the actual dependent variable y.

The formula (3) is:

。

wherein,

is a predicted value of the dependent variable y of the element data.

And y is a dependent variable y in the element data.

m is a time sequence L _i Number of element data of (1).

e _i Is a time sequence L _i Total error of (2).

And S43, determining the time sequence with the minimum K3 total errors based on the total error of each time sequence.

Wherein K3 is a preset value, and K3 is more than or equal to 0 and less than or equal to N/10.

S44, acquiring a third trend data set D3 based on the time sequence with the minimum K3 total errors.

The S44 specifically includes:

respectively taking the time stamp of the last element data in the time sequence with the minimum K3 total errors as a starting point, and respectively obtaining the time periods T backwards ₀ The element data in the time series data of the production operation are merged to obtain a third trend data set D3.

In the embodiment, based on the approximate running state of the model to be evaluated in the selection history, the prediction capability of the model to be evaluated in a real use scene can be better reflected through the error selection data set obtained by the model to be evaluated.

Specifically, the S5 specifically includes:

s51, aiming at the first trend data set D1, the second trend data set D2 and the third trend data set D3, respectively, in a first proportion w ₁ A second proportion w ₂ A third proportion w ₃ Sampling is performed to obtain a first trend sample set D1, a second trend sample set D2 and a third trend sample set D3.

And S52, merging and de-duplicating the first trend sample set D1, the second trend sample set D2 and the third trend sample set D3 to obtain a final evaluation data set D.

In this embodiment, the data sets with different characteristics (the first trend sample set D1, the second trend sample set D2, and the third trend sample set D3) are included in the evaluation data set, so that the prediction capability and the generalization capability of the evaluation data set can be considered, and the evaluation of the model to be evaluated is more objective and comprehensive.

According to the evaluation data set construction method for the process type industrial production data stream, a first trend data set D1 is constructed by adopting a distance similarity screening strategy from a change rule of a time sequence; obtaining the period T of the sequence X by utilizing an autocorrelation coefficient processing mode ₃ Through a period T ₃ Further, a second trend data set D2 and a second trend data set to be trained are constructedThe evaluated model constructs a third trend data set D3; and finally, the data sets with different characteristics are included in the evaluation data set of the model to be evaluated, so that the constructed evaluation data set can better reflect the prediction accuracy and generalization capability of the model at a specific use moment.

Finally, in the practical application of the embodiment, the final evaluation data set D is input into the model to be evaluated, so as to obtain a more accurate evaluation result.

Example two

In order to better understand the scheme of the embodiment of the present invention, the steps of the embodiment of the present invention are described in detail below.

The embodiment provides a method for constructing an evaluation data set for a flow-type industrial production data stream, which comprises the following steps:

101. selecting a time sequence list [ L ] on a time axis for time series data of a production data stream ₁ ， L ₂ ，L _i ， …，L _N ]And a specified sequence L ₀ ，L ₀ The corresponding time length and the update period T of the model to be evaluated ₀ Are identical and specify the sequence L ₀ Comprising [ T-T ] ₀ ，t ]Generating element data of a data stream in a time period, wherein t is a current timestamp; that is to say specifying the sequence L ₀ The real operating data closest to the model to be evaluated on the time axis.

z _0w To specify the sequence L ₀ The w-th element data arranged in time series.

Wherein L is _i =[z _i1 ，z _i2 ，...z _ij ... z _im ]。

Wherein,

and F is a preset value.

For example, in the specific application of the embodiment, as shown in table 1, the specified sequence L corresponding to the model to be evaluated with the main steam production as a dependent variable ₀ The part of data (independent variables comprise the amount of coal entering the furnace, the total air volume entering the furnace and the content of oxygen in exhaust gas):

TABLE 1

Time stamp	Amount of coal charged into the furnace	Total air flow into furnace	Oxygen content of exhaust gas	Main steam production
					1647050100000	15.721	76.5558	2.1969	171.1258
1647050106000	15.5656	75.834	2.4192	171.2354
					1647050112000	15.4694	76.1088	2.4407	171.2512
1647050118000	16.0389	76.8488	2.362	172.0235
					1647050124000	16.027	76.0773	2.4075	172.1235
1647050130000	16.7334	76.1407	2.1873	172.3258
					1647050136000	16.993	76.2791	2.1918	173.3984
1647050142000	17.3402	76.0581	2.2747	173.9423
					1647050148000	17.3856	77.4178	2.5426	173.4521
1647050154000	17.6544	77.3188	2.3211	174.2144
					1647050160000	17.0855	77.5035	2.2665	174.6845
1647050166000	17.6559	77.9596	2.3259	174.9545
					1647050172000	17.7959	79.6381	2.2277	175.3632
1647050178000	18.0086	79.8114	2.2929	175.8852
					1647050184000	18.1417	79.6916	2.3011	175.8567

102. For a given sequence L ₀ And a time series L _i Obtaining a specified sequence L ₀ And a time series L _i Corresponding distance matrix D _{（L0，Li）} 。

Wherein,

。

wherein,

。

103. Based on the distance matrix D _{（L0，Li）} Recursion formula (1) is adopted to recur the distance matrix D _{（L0，Li）} Middle element d ₁₁ To element d _mn Minimum distance L therebetween _min (m, n) and the minimum distance L _min (m, n) as a designated sequence L ₀ And a time series L _i The distance similarity between them.

The formula (1) is:

。

Wherein,

，

，

。

104. based on a given sequence L ₀ And time series List [ L ₁ ， L ₂ ，L _i ， …，L _N ]The distance similarity between each time series in the time series is obtained, and a similarity list is obtained.

The similarity list includes: time series List [ L ₁ ， L ₂ ，L _i ， …，L _N ]And the middle K1 maximum distance similarities respectively correspond to the time series.

Wherein K1 is a preset value, and K1 is more than or equal to 0 and less than or equal to N/10.

105. And acquiring a first timestamp set based on the similarity list.

106. Respectively backward acquiring a period T by taking each timestamp in the first timestamp set as a starting point ₀ In-line production run timing dataAnd the union of the element data to obtain a first trend data set D1.

107. Based on the sequence X with the specified length and p preset translation time segments, two subdata sets corresponding to the sequence X with the specified length in any preset translation time segment are obtained.

The sequence X of the specified length comprises: a second time interval T ₂ H element data arranged according to the time sequence in the internal historical production operation time sequence data.

X=[z ₁ ，z ₂ ，...z _r ...，z _h ]。

Wherein p is more than or equal to 0 and less than or equal to 30.

The first sub data set of any one of the predetermined panning time segments comprises: at said second time interval T ₂ Is taken as the start time of any one of the preset translation time segments, and the element data in the sequence X with the specified length in any one of the preset translation time segments.

The second sub data set of any one of the predetermined panning time segments comprises: at said second time interval T ₂ As a starting moment of said any predetermined fraction of translation timeStart time, except for the element data in sequence X of other specified length within any of the predetermined translation time segments.

108. And (3) respectively acquiring autocorrelation coefficients of two sub data sets corresponding to the sequence X with the specified length in any preset translation time segment by adopting a formula (2).

The formula (2) is:

。

S33, determining the sequence with the specified length based on the autocorrelation coefficients of two sub data sets corresponding to the sequence X with the specified length in any preset translation time sliceThe first time period T corresponding to the row X ₃ 。

109. At the current moment T, every other first time period T ₃ Acquiring element data to obtain a timestamp corresponding to the element data to obtain a timestamp list A with K2 timestamps ₂ 。

Wherein,

110. Based on the time stamp list A ₂ A second trend data set D2 is determined.

The 110 specifically includes:

111. List of time sequences [ L ] to be selected on time axis ₁ ， L ₂ ，L _i ， …，L _N ]Inputting each element data in each time sequence into a trained model to be evaluated for prediction, and obtaining a prediction result of each element data in each time sequence.

112. And acquiring the total error of each time series based on the prediction result of each element data in each time series and the actual dependent variable y.

The 112 specifically includes:

The formula (3) is:

。

wherein,

is a predicted value of the dependent variable y of the element data.

And y is a dependent variable y in the element data.

m is a time sequence L _i Number of element data of (1).

e _i Is a time sequence L _i The total error of (c).

113. And determining the time sequence with the minimum K3 total errors based on the total error of each time sequence.

114. Based on the K3 time series with the smallest total error, a third trend data set D3 was acquired.

The 114 specifically includes:

respectively taking the time stamp of the last element data in the time sequence with the minimum K3 total errors as a starting point, and respectively obtaining time periods T backwards ₀ The element data in the time series data of the production operation are merged to obtain a third trend data set D3.

115. And sampling the D1, the D2 and the D3 to obtain an evaluation data set for evaluating the model to be evaluated.

Specifically, the 115 specifically includes:

aiming at the first trend data set D1, the second trend data set D2 and the third trend data set D3 respectively in a first proportion w ₁ A second proportion w ₂ A third proportion w ₃ Sampling to obtain the corresponding first trend sample set D1 and the second trend sample setSet D2, third trend sample set D3.

And (3) merging the first trend sample set D1, the second trend sample set D2 and the third trend sample set D3, and performing deduplication processing to obtain a final evaluation data set D.

According to the method for constructing the evaluation data set for the flow-type industrial production data stream, a first trend data set D1 is constructed by starting from a change rule of a time sequence and utilizing a distance similarity screening strategy; obtaining the period T of the sequence X by utilizing an autocorrelation coefficient processing mode ₃ Through a period T ₃ Further constructing a second trend data set D2 and a third trend data set D3 through the trained model to be evaluated; and finally, the data sets with different characteristics are included in the evaluation data set of the model to be evaluated, so that the constructed evaluation data set can better reflect the prediction accuracy and generalization capability of the model at a specific use moment.

Finally, in the practical application of the embodiment, the final evaluation data set D is input into the model to be evaluated, so as to obtain a corresponding more accurate evaluation result.

As can be seen from the comparison of fig. 2, fig. 3, and fig. 4, the error distribution of the model to be evaluated on the evaluation data set constructed by the existing leave-out method is significantly larger than the error distribution of the model to be evaluated on the time series data of the actual production data stream and the error distribution of the model to be evaluated on the evaluation data set constructed by the evaluation data set construction method facing the flow-type industrial production data stream in this embodiment, the error distribution of the model to be evaluated on the evaluation data set constructed by the evaluation data set construction method facing the flow type industrial production data flow in the embodiment is within ± 3, and the mean value is 0, which indicates that the evaluation data set constructed by the evaluation data set construction method facing the flow type industrial production data flow in the embodiment is closer to the real operation data, that is, the prediction performance of the model to be evaluated in the actual use can be reflected.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions.

It should be noted that in the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the terms first, second, third and the like are for convenience only and do not denote any order. These words are to be understood as part of the name of the component.

Furthermore, it should be noted that in the description of the present specification, the description of the term "one embodiment", "some embodiments", "examples", "specific examples" or "some examples", etc., means that a specific feature, structure, material or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, the claims should be construed to include preferred embodiments and all changes and modifications that fall within the scope of the invention.

It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention should also include such modifications and variations.

Claims

1. A method for constructing an evaluation data set oriented to a flow type industrial production data flow is characterized by comprising the following steps:

s1, aiming at the time sequence data of the production data stream, selecting a time sequence list [ L ] on the time axis ₁ ， L ₂ ，L _i ， …，L _N ]And a specified sequence L ₀ ，L ₀ The corresponding time length and the update period T of the model to be evaluated ₀ Are identical and specify the sequence L ₀ Comprising [ T-T ₀ ，t ]Generating element data of a data stream in a time period, wherein t is a current timestamp;

s2, adopting a distance similarity screening strategy to obtain L ₀ Medium dependent variables y and L _i The distance similarity of the medium dependent variable y and a similarity list is obtained; constructing a first trend data set D1 in the time series data of the production data stream based on the similarity list and a preset construction mode;

s3, selecting a sequence X with a specified length according to historical data of the production data stream, and acquiring the period T of the sequence X by adopting an autocorrelation coefficient processing mode ₃ Based on said T ₃ Generating a time stamp list of a preset length k2, and generating a production data stream according to elements in the time stamp list and a preset construction modeConstructing a second trend data set D2 in the time series data;

2. The method of claim 1,

z _0w To specify the sequence L ₀ W-th element data arranged in time series;

wherein L is _i =[z _i1 ，z _i2 ，...z _ij ... z _im ]；

z _ij Is a time sequence L _i In chronological orderjAn individual element data;

wherein,

f is a preset value;

3. The method according to claim 2, wherein the S2 specifically includes:

the similarity list includes: time series List [ L ₁ ， L ₂ ，L _i ， …，L _N ]The time series corresponding to the middle K1 maximum distance similarities respectively;

4. The method according to claim 3, wherein the S21 specifically includes:

s211, aiming at the designated sequence L ₀ And a time series L _i Obtaining a specified sequence L ₀ And a time series L _i Corresponding distance matrix D _{（L0，Li）} ；

Wherein,

；

wherein,

；

to specify the sequence L ₀ W-th element arranged in time sequenceDependent variable y in the pixel data;

is a time sequence L _i In chronological orderjDependent variable y in individual element data;

s212, based on the distance matrix D _{（L0，Li）} Using a recursion formula (1), recursion of the distance matrix D _{（L0，Li）} Middle element d ₁₁ To element d _mn Minimum distance L therebetween _min (m, n) and the minimum distance L _min (m, n) as a designated sequence L ₀ And a time series L _i Distance similarity between them;

the formula (1) is:

；

wherein L is _min (w, j) is the element d in the distance matrix ₁₁ To any element d in the distance matrix _wj A minimum distance of;

wherein,

；

；

。

5. the method according to claim 4, wherein the S23 specifically includes:

s231, acquiring a first timestamp set based on the similarity list;

6. The method according to claim 5, wherein S3 specifically comprises:

X=[z ₁ ，z ₂ ，...z _r ...，z _h ]；

the p preset translation time slices sequentially include: t is t ₁ 、t ₂ 、...t _g ...t _p ；

Wherein p is more than or equal to 0 and less than or equal to 30;

the two sub data sets corresponding to the sequence X with the specified length in any preset translation time segment are respectively a first sub data set and a second sub data set of the any preset translation time segment;

the first sub data set of any one of the predetermined panning time segments comprises: at said second time interval T ₂ Is taken as the start time of any one of the predetermined translation time segments, at whichElement data in a sequence X of a specified length within any preset translation time segment;

the formula (2) is:

for a sequence X of a given length within a predetermined translation time segment t _g Dependent variable y in the r-th element data in the corresponding first sub data set;

s33, determining a first time period T corresponding to the sequence X with the specified length based on the autocorrelation coefficients of two sub data sets corresponding to the sequence X with the specified length in any preset translation time segment ₃ ；

s34, at current time T, every other first time period T ₃ Acquiring element data to obtain a timestamp corresponding to the element data to obtain a timestamp list A with K2 timestamps ₂ ；

Wherein,

the S35 specifically includes:

listing by time stamp A ₂ Taking each timestamp as a starting point, and respectively obtaining a first time period T backwards ₃ And merging the internal element data to obtain a second trend data set D2.

7. The method according to claim 6, wherein the S4 specifically includes:

s41, list of time series [ L ] to be selected on time axis ₁ ， L ₂ ，L _i ， …，L _N ]Each in each time series ofRespectively inputting the element data into a trained model to be evaluated for prediction, and obtaining a prediction result of each element data in each time sequence;

s44, acquiring a third trend data set D3 based on the time sequence with the minimum K3 total errors;

the S44 specifically includes:

8. The method according to claim 7, wherein the S42 specifically includes:

the formula (3) is:

；

wherein,

is a predicted value of the dependent variable y of the element data;

y is a dependent variable y in the element data;

m is a time sequenceColumn L _i The number of element data of (a);

e _i is a time sequence L _i Total error of (2).

9. The method according to claim 8, wherein the S5 specifically includes:

10. An evaluation data set construction system oriented to a flow-type industrial production data stream is characterized by comprising the following steps:

at least one processor; and

at least one memory communicatively coupled to the processor, wherein the memory stores program instructions executable by the processor, and the processor invokes the program instructions to perform the method for profiling dataset construction for flowsheet industrial process data streams according to any of claims 1-9.