CN115099370B

CN115099370B - Evaluation data set construction method and system for flow-type industrial production data stream

Info

Publication number: CN115099370B
Application number: CN202211014655.5A
Authority: CN
Inventors: 南玉泽; 王栋; 党海峰; 夏建涛
Original assignee: Beijing Quanying Technology Co ltd
Current assignee: Beijing Quanying Technology Co ltd
Priority date: 2022-08-23
Filing date: 2022-08-23
Publication date: 2022-12-02
Anticipated expiration: 2042-08-23
Also published as: CN115099370A

Abstract

The invention relates to a method and a system for constructing an evaluation data set for a flow-type industrial production data stream, wherein the method comprises the following steps: s1, selecting a time sequence list

And L ₀ (ii) a S2, adopting a distance similarity screening strategy to obtain L ₀ Medium dependent variables y and L _i Obtaining the similarity list by the distance similarity of the medium dependent variable y; constructing a first trend data set D1 based on the similarity list and a preset construction mode; s3, selecting a sequence X with a specified length according to historical data of the production data stream, and acquiring the period T of the sequence X by adopting an autocorrelation coefficient processing mode ₃ Based on T ₃ Generating a time stamp list, and constructing a second trend data set D2 according to elements in the time stamp list and a preset construction mode; s4, adopting a time sequence list

Acquiring an error sequence from the trained model to be evaluated, and constructing a third trend data set D3 based on the error sequence and a preset construction mode; and S5, sampling the D1, the D2 and the D3 to obtain an evaluation data set.

Description

Evaluation data set construction method and system for flow-type industrial production data stream

Technical Field

The invention relates to the technical field of process type industrial production, in particular to a method and a system for constructing an evaluation data set for a process type industrial production data stream.

Background

For a machine learning modeling project under a flow type industrial production scene, the effectiveness evaluation of a model should run through model training, model updating and on-line operation, and whether the effectiveness evaluation of a mathematical model is reasonable is established on the basis of a correct evaluation mode and an effective evaluation data set, that is, to realize the effect evaluation of the full life cycle of the model, a reasonable evaluation data set establishment method should be provided in addition to a correct evaluation method to ensure that data used for model testing are sufficient and correct.

The existing evaluation data set is mainly constructed by a leave-out method, a cross-validation method and a self-service method. However, in a process-type industrial production scene, an evaluation data set established by the existing method cannot well reflect the real data distribution situation of the model at the use time because the evaluation data set does not have the trend and the periodic characteristics of a time sequence, the validity and the generalization capability of the model cannot be correctly reflected on the data set, and the test on the model causes distortion because the evaluation data set is invalid.

Disclosure of Invention

Technical problem to be solved

In view of the above disadvantages and shortcomings of the prior art, the present invention provides a method and a system for constructing an evaluation data set for a flow-type industrial production data stream, which solves the technical problems that the existing evaluation data set does not have the characteristics of time-series trend and periodicity, and the validity and generalization capability of a model to be evaluated cannot be correctly reflected when the evaluation data set is evaluated.

(II) technical scheme

In order to achieve the purpose, the invention adopts the main technical scheme that:

the embodiment of the invention provides a method for constructing an evaluation data set for a flow-type industrial production data stream, which comprises the following steps:

s1, aiming at time sequence data of production data stream, selecting a time sequence list [ L ] on a time axis ₁ ， L ₂ ，L _i ， …，L _N ]And a specified sequence L ₀ ，L ₀ The corresponding time length and the update period T of the model to be evaluated ₀ Are identical and specify the sequence L ₀ Comprising [ T-T ₀ ，t ]Generating element data of a data stream in a time period, wherein t is a current timestamp;

s2, adopting a distance similarity screening strategy to obtain L ₀ Medium dependent variables y and L _i The distance similarity of the medium dependent variable y and a similarity list is obtained; constructing a first trend data set D1 in time series data of the production data stream based on the similarity list and a preset construction mode;

s3, selecting a sequence X with a specified length according to historical data of the production data stream, and acquiring the period T of the sequence X by adopting an autocorrelation coefficient processing mode ₃ Based on said T ₃ Generating a time stamp list with a preset length k2, and constructing a second trend data set D2 in the time sequence data of the production data stream according to elements in the time stamp list and a preset construction mode;

s4, adopting a time sequence list L ₁ ， L ₂ ，L _i ， …，L _N ]Acquiring an error sequence from the trained model to be evaluated, and constructing a third trend data set D3 in the time sequence data of the production data stream based on the error sequence and a preset construction mode;

and S5, sampling the D1, the D2 and the D3 to obtain an evaluation data set for evaluating the model to be evaluated.

Preferably, the first and second liquid crystal display panels are,

the specified sequence L ₀ The method comprises the following steps: z is a radical of ₀₁ ，z ₀₂ ，...z _0w ...z _0n ；

z _0w To specify the sequence L ₀ W-th element data arranged in time series;

the time series L _i The method comprises the following steps: a predetermined first time interval T ₁ M pieces of element data arranged in chronological order in time series data of the production data stream before the current timestamp t in the stream;

wherein L is _i =[z _i1 ，z _i2 ，...z _ij ... z _im ]；

z _ij Is a time sequence L _i The jth element data arranged in time sequence;

wherein the content of the first and second substances,

f is a preset value;

each element data includes: a timestamp corresponding to the element data, and an independent variable and a dependent variable y corresponding to a preset model to be evaluated.

Preferably, the S2 specifically includes:

s21, adopting a distance similarity screening strategy and based on an appointed sequence L ₀ And time series List [ L ₁ ， L ₂ ，L _i ， …，L _N ]Separately obtaining the specified sequences L ₀ And time series List [ L ₁ ， L ₂ ，L _i ， …，L _N ]Distance similarity between each time series in (a);

s22, based on the designated sequence L ₀ And time series List [ L ₁ ， L ₂ ，L _i ， …，L _N ]Obtaining a similarity list according to the distance similarity between each time sequence;

the similarity list includes: time series List [ L ₁ ， L ₂ ，L _i ， …，L _N ]The medium K1 maximum distance similarities respectively correspond to the time sequences;

wherein K1 is a preset value, and K1 is more than or equal to 0 and less than or equal to N/10;

and S23, acquiring a first trend data set D1 in the historical production operation time sequence data based on the similarity list and a preset construction mode.

Preferably, the S21 specifically includes:

s211, aiming at the designated sequence L ₀ And a time series L _i Obtaining a specified sequence L ₀ And a time series L _i Corresponding distance matrix D _{（L0，Li）} ；

Wherein the content of the first and second substances,

；

wherein the content of the first and second substances,

；

to specify the sequence L ₀ The dependent variable y in the w-th element data arranged in time sequence;

is a time sequence L _i The dependent variable y in the jth element data arranged according to the time sequence;

s212, based on the distance matrix D _{（L0，Li）} Using a recursion formula (1), recursion of the distance matrix D _{（L0，Li）} Middle element d ₁₁ To element d _mn Minimum distance L therebetween _min (m, n) and the minimum distance L _min (m, n) as a designated sequence L ₀ And a time series L _i Distance similarity between them;

the formula (1) is:

；

wherein L is _min (w, j) is the element d in the distance matrix ₁₁ To any element d in the distance matrix _wj A minimum distance of;

wherein the content of the first and second substances,

；

；

。

preferably, the S23 specifically includes:

s231, acquiring a first timestamp set based on the similarity list;

the first set of timestamps includes: a timestamp corresponding to the last element data in each time sequence in the similarity list;

and S232, with each timestamp in the first timestamp set as a starting point, respectively obtaining element data in the production operation time sequence data in the period T0 backwards, and obtaining a union set to obtain a first trend data set D1.

Preferably, S3 specifically includes:

s31, based on the sequence X with the specified length and p preset translation time segments, acquiring two subdata sets corresponding to the sequence X with the specified length in any preset translation time segment;

the specified length of sequence X comprises: a second time interval T ₂ H element data arranged according to a time sequence in the internal historical production operation time sequence data;

wherein the second time interval T ₂ The internal period is more than or equal to 15 days;

X=[z ₁ ，z ₂ ，...z _r ...，z _h ]；

z _r for a second time interval T ₂ The r-th element data arranged according to the time sequence in the internal historical production operation time series data;

the p preset translation time slices sequentially include: t is t ₁ 、t ₂ 、...t _g ...t _p ；

Wherein p is more than or equal to 0 and less than or equal to 30;

wherein, t _g The method comprises the steps of obtaining a g-th preset translation time segment in p preset translation time segments;

the two sub data sets corresponding to the sequence X with the specified length in any preset translation time segment are respectively a first sub data set and a second sub data set of the any preset translation time segment;

the first sub data set of any one of the predetermined panning time segments comprises: at said second time interval T ₂ As the start time of said any one of the pre-set translation time segments, at whichElement data in a sequence X of a specified length within any preset translation time segment;

the second sub data set of any one of the predetermined panning time segments comprises: at said second time interval T ₂ The starting time of the sequence X is used as the starting time of any one preset translation time segment, and the starting time is divided by the element data in the sequences X with other specified lengths in any one preset translation time segment;

s32, respectively acquiring autocorrelation coefficients of two subdata sets corresponding to the sequence X with the specified length in any preset translation time segment by adopting a formula (2);

the formula (2) is:

for a sequence X of a given length within a predetermined translation time segment t _g The autocorrelation coefficients of the corresponding two subdata sets;

is the mean value of the dependent variable y in the element data of the sequence X with the specified length;

X _r the method comprises the steps of (1) obtaining a dependent variable y in the r-th element data in a sequence X with a specified length;

for a sequence X of a given length within a predetermined translation time segment t _g A dependent variable y in the r-th element data in the corresponding first sub data set;

for sequences of specified length X inPredetermined translation time slice t _g Dependent variable y in the r-th element data in the corresponding second sub data set;

s33, determining a first time period T corresponding to the sequence X with the specified length based on the autocorrelation coefficients of two sub data sets corresponding to the sequence X with the specified length in any preset translation time segment ₃ ；

Wherein the first time period T ₃ Is the corresponding time interval between two adjacent peaks in the first curve;

the first curve is formed by connecting autocorrelation coefficients of two sub-data sets corresponding to all obtained sequences X with specified lengths in any preset translation time segment according to the arrangement sequence of the corresponding translation time segments;

s34, every other first time period T at the current moment T ₃ Acquiring element data to obtain a timestamp corresponding to the element data to obtain a timestamp list A with K2 timestamps ₂ ；

Wherein, the first and the second end of the pipe are connected with each other,

(ii) a K2 is more than or equal to 0 and less than or equal to h/50, and h is the number of element data of the sequence X with the specified length;

s35, based on the time stamp list A ₂ Determining a second trend data set D2;

the S35 specifically includes:

by time stamp list A ₂ Taking each timestamp as a starting point, and respectively obtaining a first time period T backwards ₃ And merging the internal element data to obtain a second trend data set D2.

Preferably, the S4 specifically includes:

s41, time sequence list [ L ] to be selected on time axis ₁ ， L ₂ ，L _i ， …，L _N ]Respectively inputting each element data in each time sequence into a trained model to be evaluated for prediction, and obtaining a prediction result of each element data in each time sequenceFruit;

wherein the trained model to be evaluated is previously determined by the specified sequence L ₀ The element data in (1) is trained;

s42, acquiring a total error of each time sequence based on the prediction result of each element data in each time sequence and the actual dependent variable y;

s43, determining K3 time sequences with the minimum total error based on the total error of each time sequence;

wherein K3 is a preset value, and K3 is more than or equal to 0 and less than or equal to N/10;

s44, acquiring a third trend data set D3 based on the K3 time sequences with the minimum total errors;

the S44 specifically includes:

respectively taking the time stamp of the last element data in the time sequence with the minimum K3 total errors as a starting point, and respectively obtaining the time period T backwards ₀ And (4) merging the element data in the production operation time sequence data to obtain a third trend data set D3.

Preferably, the S42 specifically includes:

acquiring a total error of each time series by adopting a formula (3) based on a prediction result of each element data in each time series and an actual value of a dependent variable y;

the formula (3) is:

；

wherein the content of the first and second substances,

is a predicted value of the dependent variable y of the element data;

y is a dependent variable y in the element data;

m is a time sequence L _i The number of element data of (a);

e _i is a time sequence L _i Total error of (2).

Preferably, the S5 specifically includes:

s51, aiming at the first trend data set D1, the second trend data set D2 and the third trend data set D3, respectively according to a first proportion w ₁ A second proportion w ₂ A third proportion w ₃ Sampling is carried out, and a first trend sampling set D1, a second trend sampling set D2 and a third trend sampling set D3 are obtained correspondingly;

and S52, taking a union set of the first trend sample set D1, the second trend sample set D2 and the third trend sample set D3 and performing deduplication processing to obtain a final evaluation data set D.

On the other hand, the embodiment further provides an evaluation data set constructing system for the process-type industrial production data stream, which includes:

at least one processor; and

at least one memory communicatively coupled to the processor, wherein the memory stores program instructions executable by the processor, and the processor calls the program instructions to perform any of the above described profiling data set constructing methods for flowsheet industrial production data streams.

(III) advantageous effects

The invention has the beneficial effects that: the invention relates to a method and a system for constructing an evaluation data set facing a flow type industrial production data stream, wherein a first trend data set D1 is constructed by adopting a distance similarity screening strategy starting from a change rule of a time sequence; obtaining the period T of the sequence X by utilizing an autocorrelation coefficient processing mode ₃ Through a period T ₃ Further constructing a second trend data set D2 and a third trend data set D3 through the trained model to be evaluated; and finally, the data sets with different characteristics are included in the evaluation data set of the model to be evaluated, so that the constructed evaluation data set can better reflect the prediction accuracy and generalization capability of the model at a specific use moment.

Drawings

FIG. 1 is a flow chart of an evaluation data set construction method for a flow-type industrial production data flow according to the present invention;

FIG. 2 is a schematic diagram of error distribution of a model to be evaluated on time series data of an actual production data stream;

fig. 3 is a schematic diagram of error distribution of a model to be evaluated on an evaluation data set constructed by the evaluation data set construction method for a flow-type industrial production data flow in this embodiment;

fig. 4 is a schematic diagram of error distribution of a model to be evaluated on an evaluation data set constructed by a conventional outflow method.

Detailed Description

For the purpose of better explaining the present invention and to facilitate understanding, the present invention will be described in detail by way of specific embodiments with reference to the accompanying drawings.

In order to better understand the above technical solutions, exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

Example one

Referring to fig. 1, the embodiment provides a method for constructing an evaluation data set for a flow-type industrial production data stream, including:

s1, aiming at time sequence data of production data stream, selecting a time sequence list [ L ] on a time axis ₁ ， L ₂ ，L _i ， …，L _N ]And a specified sequence L ₀ ，L ₀ The corresponding time length and the update period T of the model to be evaluated ₀ Are identical and specify the sequence L ₀ Comprising [ T-T ] ₀ ，t ]And generating element data of the data stream in the time period, wherein t is the current time stamp.

In practical application of this embodiment, the specified sequence L ₀ The method comprises the following steps: z is a radical of ₀₁ ，z ₀₂ ，...z _0w ...z _0n 。

z _0w To specify the sequence L ₀ The w-th element data arranged in time series.

The time series L _i The method comprises the following steps: a predetermined first time interval T ₁ M pieces of element data arranged in chronological order in time series data of the production data stream before the current time stamp t in the inner.

Wherein L is _i =[z _i1 ，z _i2 ，...z _ij ... z _im ]。

z _ij Is a time sequence L _i In chronological orderjThe individual element data.

and F is a preset value.

Each element data includes: the time stamp corresponding to the element data, and the independent variable and the dependent variable y corresponding to the preset model to be evaluated.

S2, adopting a distance similarity screening strategy to obtain L ₀ Medium dependent variables y and L _i The distance similarity of the medium dependent variable y and a similarity list is obtained; and constructing a first trend data set D1 in the time sequence data of the production data stream based on the similarity list and a preset construction mode.

The S2 specifically comprises the following steps:

s21, adopting a distance similarity screening strategy and based on an appointed sequence L ₀ And time series List [ L ₁ ， L ₂ ，L _i ， …，L _N ]Separately obtaining the specified sequences L ₀ And time series List [ L ₁ ， L ₂ ，L _i ， …，L _N ]The distance similarity between each time series.

Specifically, the S21 specifically includes:

s211, aiming at the designated sequence L ₀ And a time series L _i Obtaining a specified sequence L ₀ And a time series L _i Corresponding distance matrix D _{（L0，Li）} 。

。

wherein the content of the first and second substances,

。

to specify the sequence L ₀ In chronological orderwDependent variable y in individual element data.

Is a time sequence L _i In chronological orderjDependent variable y in individual element data.

S212, based on the distance matrix D _{（L0，Li）} Using a recursion formula (1), recursion of the distance matrix D _{（L0，Li）} Middle element d ₁₁ To element d _mn Minimum distance L therebetween _min (m, n) and the minimum distance L _min (m, n) as a designated sequence L ₀ And a time series L _i The distance similarity between them.

The formula (1) is:

。

wherein L is _min (w, j) is the element d in the distance matrix ₁₁ To any element d in the distance matrix _wj The minimum distance of (c).

Wherein the content of the first and second substances,

，

，

。

s22, based on the designated sequence L ₀ And time series List [ L ₁ ， L ₂ ，L _i ， …，L _N ]The distance similarity between each time series in the time series is obtained, and a similarity list is obtained.

The similarity list includes: time series List [ L ₁ ， L ₂ ，L _i ， …，L _N ]And the K1 maximum distance similarities in the time series respectively correspond to the time series.

Wherein K1 is a preset value, and K1 is more than or equal to 0 and less than or equal to N/10.

Specifically, the S23 specifically includes:

s231, acquiring a first timestamp set based on the similarity list.

The first set of timestamps includes: and the time stamp corresponding to the last element data in each time sequence in the similarity list.

S232, taking each timestamp in the first timestamp set as a starting point, and respectively obtaining a period T backwards ₀ And (4) merging the element data in the time sequence data of the production operation to obtain a first trend data set D1.

In the embodiment, the similarity of two time series with different lengths is quantified, and the periodic data set is combined to better accord with the characteristics of the time series data.

S3, selecting a sequence X with a specified length according to historical data of the production data stream, and acquiring the period T of the sequence X by adopting an autocorrelation coefficient processing mode ₃ Based on said T ₃ And generating a time stamp list with a preset length k2, and constructing a second trend data set D2 in the time sequence data of the production data stream according to elements in the time stamp list and a preset construction mode.

In practical application of this embodiment, S3 specifically includes:

s31, based on the sequence X with the specified length and p preset translation time slices, two sub data sets corresponding to the sequence X with the specified length in any preset translation time slice are obtained.

The sequence X of the specified length comprises: a second time interval T ₂ Arranged in chronological order in time series data of internal historical production runshThe individual element data.

Wherein the second time interval T ₂ The internal period is more than or equal to 15 days.

X=[z ₁ ，z ₂ ，...z _r ...，z _h ]。

z _r For a second time interval T ₂ The r-th element data in the time sequence of the internal historical production operation time sequence data.

The p preset translation time slices sequentially include: t is t ₁ 、t ₂ 、...t _g ...t _p 。

Wherein p is more than or equal to 0 and less than or equal to 30.

Wherein, t _g For the g-th predetermined panning time interval of the p predetermined panning time intervals.

Two sub data sets corresponding to any preset translation time segment of the sequence X with the specified length are respectively a first sub data set of the any preset translation time segment and a second sub data set of the any preset translation time segment.

The first sub data set of any one of the predetermined panning time segments comprises: at said second time interval T ₂ Is used as the start time of any one of the preset translation time segments, and the element data in the sequence X with the specified length in any one of the preset translation time segments.

The second sub data set of any one of the predetermined panning time segments comprises: at said second time interval T ₂ Is taken as the start time of any one of the preset translation time segments, and is divided by the element data in the sequence X with other specified length in any one of the preset translation time segments.

And S32, respectively acquiring autocorrelation coefficients of two sub data sets corresponding to the sequence X with the specified length in any preset translation time segment by adopting a formula (2).

The formula (2) is:

。

for a sequence X of a given length within a predetermined translation time segment t _g The autocorrelation coefficients of the corresponding two subdata sets.

Is the mean of the dependent variable y in the element data of a sequence X of a specified length.

X _r Is the dependent variable y in the r-th element data in the sequence X with the specified length.

For a sequence X of a given length within a predetermined translation time segment t _g And the dependent variable y in the r-th element data in the corresponding first sub data set.

For a sequence X of a given length within a predetermined translation time segment t _g And the dependent variable y in the r-th element data in the corresponding second sub data set.

S33, determining a first time period T corresponding to the sequence X with the specified length based on the autocorrelation coefficients of two sub data sets corresponding to the sequence X with the specified length in any preset translation time segment ₃ 。

Wherein the first time weekPeriod T ₃ Is the corresponding time interval between two adjacent peaks in the first curve.

The first curve is formed by connecting autocorrelation coefficients of two sub data sets corresponding to all obtained sequences X with specified lengths in any preset translation time segment according to the arrangement sequence of the corresponding translation time segments.

S34, every other first time period T at the current moment T ₃ Acquiring element data to obtain a timestamp corresponding to the element data to obtain a timestamp list A with K2 timestamps ₂ 。

(ii) a K2 is more than or equal to 0 and less than or equal to h/50, and h is the number of element data of the sequence X with the specified length.

S35, based on the time stamp list A ₂ A second trend data set D2 is determined.

The S35 specifically includes:

listing by time stamp A ₂ Taking each timestamp as a starting point, and respectively obtaining a first time period T backwards ₃ And merging the internal element data to obtain a second trend data set D2.

S4, adopting a time sequence list L ₁ ， L ₂ ，L _i ， …，L _N ]And training a model to be evaluated, acquiring an error sequence, and constructing a third trend data set D3 in the time sequence data of the production data stream based on the error sequence and a preset construction mode.

The S4 specifically includes:

s41, list of time series [ L ] to be selected on time axis ₁ ， L ₂ ，L _i ， …，L _N ]Inputting each element data in each time sequence into a trained model to be evaluated for prediction, and obtaining a prediction result of each element data in each time sequence.

Wherein the trained model to be evaluated is previously determined by the specified sequence L ₀ Element data training inIn (1).

And S42, acquiring the total error of each time series based on the prediction result of each element data in each time series and the actual dependent variable y.

The S42 specifically includes:

the total error of each time series is obtained by using formula (3) based on the prediction result of each element data in each time series and the value of the actual dependent variable y.

The formula (3) is:

。

wherein the content of the first and second substances,

is a predicted value of the dependent variable y of the element data.

And y is a dependent variable y in the element data.

m is a time sequence L _i Number of element data of (1).

e _i Is a time sequence L _i Total error of (2).

And S43, determining the time sequence with the minimum K3 total errors based on the total error of each time sequence.

Wherein K3 is a preset value, and K3 is more than or equal to 0 and less than or equal to N/10.

And S44, acquiring a third trend data set D3 based on the time sequence with the minimum K3 total errors.

The S44 specifically includes:

In the embodiment, the historical approximate running state is selected based on the model to be evaluated, and the prediction capability of the model to be evaluated under the real use scene can be better reflected through the error selection data set obtained by the model to be evaluated.

Specifically, the S5 specifically includes:

s51, aiming at the first trend data set D1, the second trend data set D2 and the third trend data set D3, respectively according to a first proportion w ₁ A second proportion w ₂ A third proportion w ₃ And sampling to obtain a first trend sample set D1, a second trend sample set D2 and a third trend sample set D3.

And S52, merging the first trend sample set D1, the second trend sample set D2 and the third trend sample set D3, and performing deduplication processing to obtain a final evaluation data set D.

In this embodiment, the data sets with different characteristics (the first trend sample set D1, the second trend sample set D2, and the third trend sample set D3) are included in the evaluation data set, so that the prediction capability and the generalization capability of the evaluation data set can be considered, and the evaluation of the model to be evaluated is more objective and comprehensive.

According to the method for constructing the evaluation data set facing the flow-type industrial production data stream, a first trend data set D1 is constructed by adopting a distance similarity screening strategy from a change rule of a time sequence; obtaining the period T of the sequence X by utilizing an autocorrelation coefficient processing mode ₃ Through a period T ₃ Further constructing a second trend data set D2 and a third trend data set D3 through the trained model to be evaluated; and finally, the data sets with different characteristics are included in the evaluation data set of the model to be evaluated, so that the constructed evaluation data set can better reflect the prediction accuracy and generalization capability of the model at a specific use moment.

Finally, in the practical application of the embodiment, the final evaluation data set D is input into the model to be evaluated, so as to obtain a corresponding more accurate evaluation result.

Example two

In order to better understand the scheme of the embodiment of the present invention, the steps of the embodiment of the present invention are described in detail below.

The embodiment provides an evaluation data set construction method for a flow-type industrial production data stream, which comprises the following steps:

101. selecting a time sequence list [ L ] on a time axis for time series data of a production data stream ₁ ， L ₂ ，L _i ， …，L _N ]And a specified sequence L ₀ ，L ₀ The corresponding time length and the update period T of the model to be evaluated ₀ Are identical and specify the sequence L ₀ Comprising [ T-T ] ₀ ，t ]Generating element data of a data stream in a time period, wherein t is a current timestamp; that is to say specifying the sequence L ₀ And the real operation data closest to the model to be evaluated on the time axis.

z _0w To specify the sequence L ₀ W-th element data arranged in time series.

The time series L _i The method comprises the following steps: a predetermined first time interval T ₁ M pieces of element data arranged in chronological order in time series data of the production data stream before the current time stamp t in the stream.

Wherein L is _i =[z _i1 ，z _i2 ，...z _ij ... z _im ]。

Wherein the content of the first and second substances,

and F is a preset value.

For example, in a specific application of this embodiment, table 1 is a table of the model to be evaluated with the main steam production as a dependent variableShould specify the sequence L ₀ The independent variables comprise the amount of coal entering the furnace, the total air volume entering the furnace and the oxygen content of exhaust gas:

TABLE 1

Time stamp	Amount of coal charged into the furnace	Total air flow into furnace	Oxygen content of exhaust gas	Main steam production
					1647050100000	15.721	76.5558	2.1969	171.1258
1647050106000	15.5656	75.834	2.4192	171.2354
					1647050112000	15.4694	76.1088	2.4407	171.2512
1647050118000	16.0389	76.8488	2.362	172.0235
					1647050124000	16.027	76.0773	2.4075	172.1235
1647050130000	16.7334	76.1407	2.1873	172.3258
					1647050136000	16.993	76.2791	2.1918	173.3984
1647050142000	17.3402	76.0581	2.2747	173.9423
					1647050148000	17.3856	77.4178	2.5426	173.4521
1647050154000	17.6544	77.3188	2.3211	174.2144
					1647050160000	17.0855	77.5035	2.2665	174.6845
1647050166000	17.6559	77.9596	2.3259	174.9545
					1647050172000	17.7959	79.6381	2.2277	175.3632
1647050178000	18.0086	79.8114	2.2929	175.8852
					1647050184000	18.1417	79.6916	2.3011	175.8567

102. For a given sequence L ₀ And a time series L _i Obtaining a specified sequence L ₀ And a time series L _i Corresponding distance matrix D _{（L0，Li）} 。

Wherein the content of the first and second substances,

。

。

Is a time sequence L _i In chronological orderjDependent variable y in the individual element data.

103. Based on the distance matrix D _{（L0，Li）} Using a recursion formula (1), recursion of the distance matrix D _{（L0，Li）} Middle element d ₁₁ To element d _mn A minimum distance L therebetween _min (m, n) and calculating the minimum distance L _min (m, n) as a designated sequence L ₀ And a time series L _i The distance similarity between them.

The formula (1) is:

。

Wherein the content of the first and second substances,

，

，

。

104. based on a given sequence L ₀ And time series List [ L ₁ ， L ₂ ，L _i ， …，L _N ]The distance similarity between each time series in the time series is obtained, and a similarity list is obtained.

105. And acquiring a first timestamp set based on the similarity list.

106. Respectively backward acquiring a period T by taking each timestamp in the first timestamp set as a starting point ₀ And taking and collecting element data in the production running time sequence data to obtain a first trend data set D1.

107. Based on the sequence X with the specified length and p preset translation time segments, two subdata sets corresponding to the sequence X with the specified length in any preset translation time segment are obtained.

The sequence X of the specified length comprises: a second time interval T ₂ H element data arranged according to the time sequence in the internal historical production operation time sequence data.

X=[z ₁ ，z ₂ ，...z _r ...，z _h ]。

z _r For a second time interval T ₂ The r-th element data arranged in time sequence in the internal historical production operation time series data.

Wherein p is more than or equal to 0 and less than or equal to 30.

Wherein, t _g For the g-th predetermined panning time period of the p predetermined panning time periods.

The first sub data set of any one of the predetermined panning time segments comprises: at said second time interval T ₂ Is taken as the start time of any one of the preset translation time segments, and the element data in the sequence X with the specified length in any one of the preset translation time segments.

108. And (3) respectively acquiring autocorrelation coefficients of two sub data sets corresponding to the sequence X with the specified length in any preset translation time segment by adopting a formula (2).

The formula (2) is:

。

Wherein the first time period T ₃ Is the corresponding time interval between two adjacent peaks in the first curve.

109. At the current moment T, every other first time period T ₃ Acquiring element data to obtain a timestamp corresponding to the element data to obtain a timestamp list A with K2 timestamps ₂ 。

Wherein the content of the first and second substances,

110. Based on the time stamp list A ₂ A second trend data set D2 is determined.

The 110 specifically includes:

111. List of time sequences [ L ] to be selected on time axis ₁ ， L ₂ ，L _i ， …，L _N ]Inputting each element data in each time sequence into a trained model to be evaluated for prediction, and obtaining a prediction result of each element data in each time sequence.

Wherein the trained model to be evaluated is previously determined by the specified sequence L ₀ Is trained.

112. And acquiring the total error of each time series based on the prediction result of each element data in each time series and the actual dependent variable y.

The 112 specifically includes:

The formula (3) is:

。

is a predicted value of the dependent variable y of the element data.

And y is a dependent variable y in the element data.

m is a time sequence L _i Number of element data of (1).

e _i Is a time sequence L _i The total error of (c).

113. And determining the K3 time sequences with the minimum total error based on the total error of each time sequence.

114. A third trend data set D3 is obtained based on the time series for which the K3 total errors are minimal.

The 114 specifically includes:

respectively taking the time stamp of the last element data in the time sequence with the minimum K3 total errors as a starting point, and respectively obtaining a time period T backwards ₀ And (4) merging the element data in the production operation time sequence data to obtain a third trend data set D3.

115. And sampling the D1, the D2 and the D3 to obtain an evaluation data set for evaluating the model to be evaluated.

Specifically, the 115 specifically includes:

aiming at the first trend data set D1, the second trend data set D2 and the third trend data set D3, respectively, according to a first proportion w ₁ A second proportion w ₂ A third proportion w ₃ And sampling to obtain a first trend sample set D1, a second trend sample set D2 and a third trend sample set D3.

And merging the first trend sample set D1, the second trend sample set D2 and the third trend sample set D3, and performing deduplication processing to obtain a final evaluation data set D.

As can be seen from comparison among fig. 2, fig. 3, and fig. 4, the error distribution of the model to be evaluated on the evaluation data set constructed by the existing leave-out method is significantly greater than the error distribution of the model to be evaluated on the time series data of the actual production data stream and the error distribution of the model to be evaluated on the evaluation data set constructed by the evaluation data set construction method facing the flow industrial production data stream in this embodiment, and the error distribution of the model to be evaluated on the evaluation data set constructed by the evaluation data set construction method facing the flow industrial production data stream in this embodiment is within ± 3 and has an average value of 0, which indicates that the evaluation data set constructed by the evaluation data set construction method facing the flow industrial production data stream in this embodiment is closer to the actual operation data, i.e., the prediction performance of the model to be evaluated in the actual use can be better reflected.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions.

It should be noted that in the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention can be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the terms first, second, third, etc. are used for convenience only and do not denote any order. These words are to be understood as part of the name of the component.

Furthermore, it should be noted that in the description of the present specification, the description of the term "one embodiment", "some embodiments", "examples", "specific examples" or "some examples", etc., means that a specific feature, structure, material or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, the claims should be construed to include preferred embodiments and all changes and modifications that fall within the scope of the invention.

It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention should also include such modifications and variations.

Claims

1. A method for constructing an evaluation data set oriented to a flow type industrial production data flow is characterized by comprising the following steps:

s1, aiming at time sequence data of production data stream, selecting a time sequence list [ L ] on a time axis ₁ ， L ₂ ，L _i ，…，L _N ]And a specified sequence L ₀ ，L ₀ The corresponding time length and the update period T of the model to be evaluated ₀ Are identical and specify the sequence L ₀ Comprising [ T-T ₀ ，t ]Generating element data of a data stream in a time period, wherein t is a current timestamp;

s2, aiming at the specified sequence L ₀ And a time series L _i Obtaining the specified sequence L ₀ And a time series L _i Corresponding distance matrix D _{（L0，Li）} And based on said distance matrix D _{（L0，Li）} Recursion formula is adopted to recur the distance matrix D _{（L0，Li）} Middle element d ₁₁ To element d _mn A minimum distance L therebetween _min (m, n) and the minimum distance L _min (m, n) as a designated sequence L ₀ And a time series L _i Distance similarity therebetween; based on a given sequence L ₀ And time series List [ L ₁ ， L ₂ ，L _i ，…，L _N ]The distance similarity between each time series in the time series list L is obtained ₁ ， L ₂ ，L _i ，…，L _N ]A similarity list of time sequences corresponding to the medium K1 maximum distance similarities respectively; based on the similarity list, acquiring a first timestamp set including timestamps corresponding to the last element data in each time sequence in the similarity list; taking each timestamp in the first timestamp set as a starting point, and respectively obtaining a period T backwards ₀ Element data in the production running time sequence data are merged to obtain a first trend data set D1; wherein K1 is a preset value, and K1 is more than or equal to 0 and less than or equal to N/10;

s3, based on the sequence X with the specified length and p preset translation time slices, acquiring two subdata sets corresponding to the sequence X with the specified length in any preset translation time slice; respectively acquiring autocorrelation coefficients of two subdata sets corresponding to a sequence X with a specified length in any preset translation time segment; determining a first time period T corresponding to a sequence X with a specified length based on autocorrelation coefficients of two sub data sets corresponding to the sequence X with the specified length in any preset translation time segment ₃ (ii) a At the current moment T, every other first time period T ₃ Acquiring element data to obtain a timestamp corresponding to the element data to obtain a timestamp list A with K2 timestamps ₂ (ii) a By time stamp list A ₂ Taking each timestamp as a starting point, and respectively obtaining a first time period T backwards ₃ Merging the internal element data to obtain a second trend data set D2; wherein K2 is more than or equal to 0 and less than or equal to h/50, and h is the number of element data of the sequence X with the specified length;

the sequence X of the specified length comprises: a second time interval T ₂ H element data arranged according to a time sequence in the internal historical production operation time sequence data; wherein the second time interval T ₂ The internal period is more than or equal to 15 days; x = [ z ] ₁ ，z ₂ ，...z _r ...，z _h ]；z _r For a second time interval T ₂ The r-th element data are arranged in time sequence in the internal historical production operation time sequence data; the p preset translation time slices sequentially include: t is t ₁ 、t ₂ 、...t _g ...t _p (ii) a Wherein p is more than or equal to 0 and less than or equal to 30; wherein, t _g The method comprises the steps of setting a g-th preset translation time slice in p preset translation time slices; wherein, the two subdata sets corresponding to the sequence X with the specified length in any preset translation time segment are respectively the first subdata set and the first subdata set of the any preset translation time segmentA second sub data set of the fixed translation time slice; the first sub data set of any one of the predetermined panning time segments comprises: at said second time interval T ₂ The start time of the translation time segment is used as the start time of any one preset translation time segment, and the element data in the sequence X with the specified length in any one preset translation time segment is used as the element data in the sequence X with the specified length; the second sub data set of any of the predetermined translation time slices includes: at said second time interval T ₂ The starting time of the translation time segment is used as the starting time of any one preset translation time segment, and the element data in the sequence X with other specified lengths in the translation time segment are divided;

s4, time sequence list [ L ] selected on time axis ₁ ， L ₂ ，L _i ，…，L _N ]Inputting each element data in each time sequence into a trained model to be evaluated for prediction, and obtaining a prediction result of each element data in each time sequence; acquiring a total error of each time sequence based on a prediction result of each element data in each time sequence and an actual dependent variable y; determining K3 time sequences with the minimum total error based on the total error of each time sequence; respectively taking the time stamp of the last element data in the time sequence with the minimum K3 total errors as a starting point, and respectively obtaining the time period T backwards ₀ Element data in the production running time sequence data are merged to obtain a third trend data set D3; wherein K3 is a preset value, and K3 is more than or equal to 0 and less than or equal to N/10;

s5, aiming at the first trend data set D1, the second trend data set D2 and the third trend data set D3, respectively according to a first proportion w ₁ A second proportion w ₂ A third proportion w ₃ Sampling is carried out, and a first trend sampling set D1, a second trend sampling set D2 and a third trend sampling set D3 are obtained correspondingly; and (3) merging the first trend sample set D1, the second trend sample set D2 and the third trend sample set D3, and performing deduplication processing to obtain a final evaluation data set D.

2. The method of claim 1,

the specified sequence L ₀ The method comprises the following steps: z is a radical of formula ₀₁ ，z ₀₂ ，...z _0w ...z _0n ；

z _0w To specify the sequence L ₀ W-th element data arranged in time order;

wherein L is _i =[z _i1 ，z _i2 ，...z _ij ... z _im ]；

z _ij Is a time sequence L _i In chronological orderjAn individual element data;

f is a preset value;

3. The method of claim 2,

wherein the content of the first and second substances,

；

wherein the content of the first and second substances,

；

to specify the sequence L ₀ In chronological orderwIndividual element dataThe dependent variable y in (1);

4. The method of claim 3,

the recurrence formula is:

；

wherein L is _min (w, j) is the element d in the distance matrix ₁₁ To any element d in the distance matrix _wj The minimum distance of (a);

；

；

。

5. the method of claim 4,

the formula for respectively obtaining the autocorrelation coefficients of two subdata sets corresponding to the sequence X with the specified length in any preset translation time slice is as follows:

；

for a sequence X of a given length within a predetermined translation time segment t _g Dependent variable y in the r-th element data in the corresponding first sub data set;

6. The method of claim 5,

wherein the first time period T ₃ Is the corresponding time interval between two adjacent wave crests in the first curve;

at the current moment T, every other first time period T ₃ Acquiring element data to obtain a timestamp corresponding to the element data to obtain a timestamp list A with K2 timestamps ₂ ；

。

7. the method of claim 6,

the obtaining of the total error of each time series based on the prediction result of each element data in each time series and the actual dependent variable y specifically includes: acquiring the total error of each time sequence by adopting a formula (3);

the formula (3) is:

；

wherein the content of the first and second substances,

is a predicted value of the dependent variable y of the element data;

y is a dependent variable y in the element data;

m is a time sequence L _i The number of element data of (a);

e _i is a time sequence L _i Total error of (2).

8. An evaluation data set construction system oriented to a flow-type industrial production data stream is characterized by comprising the following steps:

at least one processor; and

at least one memory communicatively coupled to the processor, wherein the memory stores program instructions executable by the processor, and the processor invokes the program instructions to perform the method for profiling dataset construction for flowsheet industrial process data streams according to any of claims 1-7.