CN115718744A

CN115718744A - Data quality measurement method based on big data

Info

Publication number: CN115718744A
Application number: CN202211499047.8A
Authority: CN
Inventors: 杨道平; 胡礼波
Original assignee: Beijing Zhonghang Lutong Technology Co ltd
Current assignee: Beijing Zhonghang Lutong Technology Co ltd
Priority date: 2022-11-28
Filing date: 2022-11-28
Publication date: 2023-02-28
Anticipated expiration: 2042-11-28
Also published as: CN115718744B

Abstract

The invention relates to the field of digital twinning, in particular to a data quality measurement method based on big data, which comprises the following steps: twin data in an industrial data warehouse are obtained and are subjected to data preprocessing; acquiring the difference of the data quantity corresponding to each time node, and performing time sequence uniform partitioning on twin big data according to the difference of the data quantity; acquiring data attributes in an industrial data warehouse, calculating attribute weights of all the attributes, acquiring a partition quality label of each time sequence uniform partition according to the attribute weights, and acquiring a time sequence data quality curve of the whole data warehouse according to the partition quality labels; obtaining a big data quality parameter according to the time sequence data quality curve; and finally, selecting data warehouse data in different fields in the optimal full life cycle corresponding to the digital twin according to the big data quality parameters, and optimizing the digital twin system. Compared with the existing method for measuring the data quality, the method improves the accuracy of measuring the quality of the twin big data.

Description

Data quality measurement method based on big data

Technical Field

The invention relates to the field of digital twinning, in particular to a data quality measurement method based on big data.

Background

With the development of science and technology, the information age has come, and the digital twin technology is vigorously developed under the promotion of science and technology and is widely applied to various industries in the industrial field. The digital twinning technology is an important technology for the field of industrial manufacturing, and the digital twinning technology is used for establishing a digital twinning model for a corresponding industrial physical entity, so that the process and the behavior of the entity in a physical field can be comprehensively described, mapped, monitored, diagnosed and optimized. And the establishment of the corresponding digital twin model needs to be supported by using large data with multiple dimensions. The quality of big data is often more important to the establishment of a digital twin model, the digital twin model established according to the big data is often not accurate enough under the condition that the quality of the big data of a certain dimensionality is poor, certain deviation can be caused to the description, mapping, monitoring, diagnosis and optimization of the certain dimensionality of an entity, a process and a behavior in a physical field, the digital twin model established under the condition that the quality of the big data is good is often more accurate, the negative influence can not occur, and therefore the big data with high quality is an important support point for the optimization of the digital twin system and the establishment of the digital twin model.

When the quality of big data is measured in the prior art means, the measurement is usually carried out through the integrity and timeliness of the whole big data, and the measurement mode has certain advantages for the whole big data, but is not objective and accurate when a digital twin system is optimized through the twin big data. When the digital twin system optimization is carried out subsequently, an inappropriate large data set is often selected easily because of insufficient data quality accuracy and objectivity, and great influence is caused on the digital twin system optimization and the digital twin model establishment.

According to the method, digital twin biological attribute data in the big data are extracted on the basis of carrying out uniform partitioning on the time sequence data quantity of the twin big data, the quality of the big data is measured in different partitions by the extracted attribute data, the quality of the whole twin big data is measured by the qualities of the big data in different partitions, the twin big data is selected according to a measurement result, and the twin big data with higher quality is selected to carry out digital twin system optimization and establishment of a digital twin model.

Disclosure of Invention

In order to solve the above problem, the present invention provides a data quality measurement method based on big data, the method comprising:

s1: acquiring digital twin data in an industrial data warehouse;

s2: respectively taking different time intervals as unit time nodes, obtaining the difference of the data quantity corresponding to each time node according to the data quantity difference of adjacent time, and obtaining time sequence uniform partitions according to the difference of the data quantity;

s3: acquiring data attributes in a data warehouse, and calculating attribute weights of all the attributes;

s4: obtaining attribute volatility of each attribute relative to the previous day according to the number of each attribute in each time sequence uniform partition and the attribute weight corresponding to each attribute;

s5, obtaining a label of the data in each time sequence uniform partition according to the number of each attribute in each time sequence uniform partition and the attribute weight corresponding to each attribute; obtaining a partition quality label of each time sequence uniform partition according to the label and attribute fluctuation of the data in each time sequence uniform partition, and obtaining a time sequence data quality curve of the data of the whole data warehouse according to the partition quality label;

s6: obtaining a big data quality parameter according to the time sequence data quality curve;

s7: and selecting corresponding optimal data warehouse data according to the big data quality parameters to re-optimize the digital twin system and establish a digital twin model.

Preferably, the step of obtaining the difference of the data size corresponding to each time node and obtaining the time-sequence uniform partition according to the difference of the data size includes:

the difference calculation formula of the corresponding data quantity of the node by taking one day as a time node is as follows:

in the formula, h ₁ Indicating that the day is the h-th in a time node ₁ A time node, whichMiddle H ₁ The number of the maximum nodes taking one day as a time node;

denotes the h th ₁ The amount of data in the data warehouse in each time node;

denotes the h-th ₁ -the amount of data in the data warehouse in 1 time node;

data volume size difference D with two days as one time node ₂ And the analogy is carried out to obtain the T epsilon [1, delta T ∈]And is and

MAXT represents the number of days of existence of all data in the industrial data warehouse;

when T days are taken as a time node, the corresponding data volume difference D _T And when the minimum time is less than the preset time, uniformly partitioning the data in all the data warehouses by taking every T days as a time node.

Preferably, the step of obtaining data attributes in the data warehouse comprises:

the attribute extraction of data related to the digital twin is completed in a data warehouse by using a named body identification technology, and the collection of related attributes A related to the digital twin is obtained as follows:

A＝{A ₁ ,A ₂ …,A _b ,…,A _B }

in the formula, B represents the total number of the attributes extracted from the related digital twin in the data warehouse; a. The _b Representing the b-th attribute.

Preferably, the step of calculating the attribute weights of all the attributes includes:

for the b-th attribute A _b Its attribute weight w _b Is attribute A _b The ratio of the number of times of occurrence in the data warehouse to the total number of attributes extracted by the digital twins in the data warehouse;

and calculating attribute weights of all attributes.

Preferably, the step of obtaining attribute volatility of each attribute relative to the previous day according to the number of occurrences of each attribute in each time sequence uniform partition and the attribute weight corresponding to each attribute includes:

the calculation formula of the attribute fluctuation of the t day in the h time sequence uniform partition relative to the attribute fluctuation of the previous day is as follows:

in the formula,. DELTA.A _t Representing attribute volatility of the t day in the h time-ordered uniform partition relative to the attribute of the previous day; b' _t B' th attribute representing an attribute that appears at day t in the h-th time-sequential uniform partition but does not appear at day t-1

B' th attribute representing occurrence on day t but no occurrence on day t-1 in the h time-sequential uniform partition

The corresponding number; w is a _b Is the b-th attribute A _b The attribute weight of (2); b' _t B "th attribute A, which indicates that it does not appear on day t but appears on day t-1 in the h time-sequential uniform partition _b” ；

B "th attribute A, which indicates that it does not occur in the t-th day but occurs in the t-1 th day in the h-th time-series uniform partition _b” Exp () represents an exponential function with a natural constant as the base.

Preferably, the step of obtaining the partition quality label of each time-sequence uniform partition includes:

for the h time sequence uniform partition, the partition quality label calculation formula of the h time sequence uniform partition is as follows:

in the formula, C _h The h time sequence is divided into the partition quality labels of the uniform partitions; t represents the t day in the h time-sequential uniform partition; t represents the total number of days in the h time-ordered uniform partition; delta A _t Representing attribute volatility of the t day in the h time-ordered uniform partition relative to the attribute of the previous day;

the number of the b-th attribute appearing at the t day in the h time sequence uniform partition is represented as

w _b Representing the attribute weight corresponding to the b-th attribute;

and calculating the partition quality labels of all time sequence uniform partitions.

Preferably, the time series data quality curve of the whole data warehouse data is obtained according to a monotone increasing model constructed by each time series uniform partition and the partition quality labels thereof.

Preferably, the obtaining of the large data quality parameter according to the time series data quality curve is obtained by integrating the time series data quality curve.

Preferably, the step of selecting the corresponding optimal data warehouse data according to the big data quality parameter to re-optimize the digital twin system and the step of establishing the digital twin model include:

calculating large data quality parameters in all data warehouses related to the whole life cycle generated by the digital twin system, and selecting the data warehouse data with the maximum quality parameter to re-optimize the digital twin system and establish a digital twin model.

The embodiment of the invention has the following beneficial effects:

1. the benefits of the present application over the prior art are: the quality of the entire data is measured not by all data but by the characteristic attributes of the data, and the quality of the large data can be measured with a large amount of calculation.

2. The benefits of the present application over the prior art are: the data measurement can be carried out by utilizing the related data generated in the running process of the digital twin system, and the accuracy is higher in the data measurement compared with the prior art. And the method is based on the partitioned time-series calculation of the data, and can greatly weaken the influence of time on the data quality measurement, so that high-quality data support is provided for the subsequent optimization of a digital twin system.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions and advantages of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a flowchart illustrating steps of a method for big data based data quality measurement according to an embodiment of the present invention.

Detailed Description

To further illustrate the technical means and effects of the present invention adopted to achieve the predetermined objects, the following detailed description, structures, features and effects of a big data based data quality measurement method according to the present invention are provided with the accompanying drawings and preferred embodiments. In the following description, different "one embodiment" or "another embodiment" refers to not necessarily the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

The following describes a specific scheme of the big data based data quality measurement method provided by the present invention in detail with reference to the accompanying drawings.

Referring to fig. 1, a flowchart illustrating steps of a big data based data quality measurement method according to an embodiment of the present invention is shown, where the method includes the following steps:

and S001, acquiring an industrial data warehouse related to the digital twin, and preprocessing data in the data warehouse.

Since the quality measurement of the big data needs to be performed according to the data in the digital warehouse, the storage of the industrial data warehouse needs to be performed according to the digital twin data generated by the data twin system, and the structuring processing is performed through the stored data in the data warehouse, where the data structure needed in this embodiment is as follows: the statistical time + the data warehouse data entry, for example, after the log information of the digital twin is imported into the data warehouse, the generation time of the log data is the statistical time, the specification of a vocabulary in the data warehouse is the data warehouse data entry, the log information of the digital twin includes log data generated by a digital twin system in a full life cycle of industrial product design, process, production, operation and maintenance, the digital twin data can be data generated in any link in the full life cycle, for example, the data generated in the industrial product design link is one of the digital twin data. The above structure is used to process the big data in all the obtained industrial data warehouses so as to measure the quality of the big data in the following steps, and this embodiment is implemented by using one data warehouse as a research object.

And S002, acquiring data volume size difference corresponding to different time nodes, and performing time sequence uniform partitioning on the big data according to the data volume size difference.

The method comprises the steps of firstly carrying out time sequence uniform partitioning on large data in the industrial data warehouse based on time sequence and equal data quantity to reduce the influence of the data quantity on the large data in the industrial data warehouse.

The data volume is an important influence factor in the quality measurement of the large data, and the data volumes with different sizes have a larger influence on the measurement of the large data, which is particularly shown in that the larger the data volume is, the larger the variation degree of the quality of the large data is, and the sizes of the data volumes at different continuous time points in time sequence are different, so that the quality measurement of the data has a larger influence. Therefore, the invention partitions the big data in the whole industrial data warehouse by using the data volume in the uniformly distributed time nodes, so that the data volume in each interval is equal, and the time span length of each interval is equal, thereby reducing the influence on the data quality measurement caused by different data volumes in the follow-up process. The data in the data warehouse is counted in units of days, so the time nodes of the time sequence of the invention take the days as the nodes, and the data in the data warehouse is evenly distributed by the data partitions of the time nodes, and the specific process is as follows:

firstly, taking one day as a time node to perform data volume difference D of the node ₁ The calculation of (2):

in the formula: h is ₁ Indicating that the day is the h-th in a time node ₁ A time node, H ₁ To obtain H in total when one day is used as a time node ₁ And (5) a time node.

Denotes the h th ₁ The amount of data in the data warehouse in each time node,

denotes the h th ₁ The amount of data in the data warehouse in 1 time node.

Wherein D is ₁ The larger the data quantity is, the larger the difference value of the data quantity is, the larger the influence is on the subsequent data quality measurement when the data quantity is quantified by taking one day as a time node, because the data quantity is too largeIn time, the stronger the diversity of the attributes is, the higher the frequency of occurrence of a single attribute is; when the data amount is too small, the diversity of attributes is weaker, and the frequency of occurrence of a single attribute is lower. If the difference of the data amount of each day is larger, the occurrence frequency of different attributes is different, and the subsequent quality measurement for each interval is inaccurate, and vice versa.

Then, taking two days as a time node to perform D of difference of data volume of each node ₂ The calculation of (2):

in the formula: h is ₂ Representing two days as the h-th in one time node ₂ A time node, H ₂ The maximum number of nodes with two days as a time node,

denotes the h th ₂ The amount of data in the data warehouse in each time node,

denotes the h th ₂ The amount of data in the data warehouse in 1 time node.

D ₂ The larger the difference value of the data quantity is, the larger the influence on the subsequent data quality measurement is, the larger the difference value is, the opposite is, when the data quantity in the data warehouse is quantified by taking 2 days as a time node.

In the same way, with

Calculating D _Δt And the acquisition set is { D ₁ ,D ₂ ,…,D _T ,…,D _Δt In which T ∈ [1, Δ T ]]And D is _T Is minimal.

In this case, the MAXT is continuously equally divided into H time-series uniform partitions in units of T, and a time-series uniform partition is referred to as a time-series uniform partition because all data are partitioned by using the same number of days T as time nodes and the data are partitioned by the time nodes so that the data amount in each partition is substantially equal.

And S003, acquiring data attributes in the data warehouse, calculating attribute weights of all the attributes, acquiring a partition quality label of each time sequence uniform partition according to the attribute weights, and acquiring a time sequence data quality curve of the data of the whole data warehouse according to the partition quality labels.

The quality of the big data is relatively speaking, so that firstly, the attribute of the digital twin biological data is extracted, then the attribute weight is used for calculating the attribute weight, the stability of the attribute in each time sequence uniform partition is calculated by using the attribute weight and the occurrence frequency of the attribute, the quality label of the big data in each time sequence uniform partition is carried out by using the stability of the attribute, the occurrence frequency of different attributes in the big data and the attribute weight corresponding to the attribute, then the quality curve of the big data in the industrial data warehouse is obtained by using the quality label of the big data in each time sequence uniform partition, and finally the quality measurement of the big data is carried out by using the quality curve, wherein the specific process is as follows:

(1) Data attributes in a data warehouse are obtained.

The data attribute extraction process is a process of extracting keywords from the data of the data warehouse for the related log information related to the digital twin, and the specific process is as follows: manually labeling related log information related to digital twin data, and then completing data attribute extraction related to the digital twin in a data warehouse by using a named body recognition technology, wherein a set of related attributes related to the digital twin can be obtained by using the method as follows:

A＝{A ₁ ,A ₂ …,A _b ,…,A _B }

wherein B represents the total number of extracted attributes of the digital twins in the data warehouse, and A represents the total number of extracted attributes of the digital twins in the data warehouse _b Representing the b-th attribute.

(2) And calculating attribute weights of all attributes.

All pairs in the extracted data warehouse are relatedThe objectivity and accuracy of the digital twin body data are different in the process of describing the physical entity by the extracted attributes of the digital twin body, and the quality of the big data in the data warehouse is different from that of the digital twin body, so that the objectivity and accuracy of the digital twin body data with different attributes need to be quantified _b For example, its attribute weight w _b The calculation formula of (2) is as follows:

in the formula: n is _b Is the b-th attribute A _b The number of times that the data is present in the data warehouse,

the total number of occurrences for all attributes.

The significance of quantifying the weight of the attribute in the data in the above manner is that the more times the attribute appears, the more objective and accurate the attribute is proved, namely w _b The greater the time, the greater the attribute A of the twin is _b The more objective and accurate the description of the relevant physical entities in the data warehouse, the higher the importance degree of the description, and the larger the weight value.

The above method is used to calculate the attribute weight of all the extracted attributes, and B attribute weights corresponding to the B attributes can be obtained.

(3) And obtaining the partition quality label of each time sequence uniform partition according to the attribute weight.

After obtaining the attribute weights of all attributes, because each interval in the time sequence uniform partition contains big data counted in different days, the attributes in each piece of data in the big data of each day are not necessarily the same, and the different attribute weights are different, the partition label of each time sequence uniform partition is quantized for the attributes in all the big data in each time sequence uniform partition, taking the h time sequence uniform partition as an example, the partition quality label calculation formula of the h time sequence uniform partition is as follows:

in the formula, C _h The h time sequence is divided into the partition quality labels of the uniform partitions; t denotes the t day in the h time-ordered uniform partition, and t e [1,T]；ΔA _t Representing attribute volatility of the t day in the h time-ordered uniform partition relative to the attribute of the previous day;

representing the number of the b-th attribute appearing at the t day in the h time sequence uniform partition; w is a _b Representing the attribute weight corresponding to the b-th attribute;

wherein, delta A _t The calculation formula of (2) is as follows:

in the formula,. DELTA.A _t Representing attribute volatility of the t day in the h time-series uniform partition relative to the previous day; b' _t B' th attribute among attributes representing attributes appearing at day t in the h-th time-series uniform partition but not appearing at day t-1

Represents the b' th attribute that appears at day t but does not appear at day t-1 in the h time-ordered uniform partition

The corresponding number; w is a _b Is the b-th attribute A _b The attribute weight of (2); b' _t B "th attribute A, which indicates that it does not occur in the t-th day but occurs in the t-1 th day in the h-th time-series uniform partition _b” ；

B "th attribute A, which indicates that it does not occur in the t-th day but occurs in the t-1 th day in the h-th time-series uniform partition _b” The number of (2);

in the formula,. DELTA.A _t The calculation process is calculated according to the fluctuation of the attribute corresponding to the data in the h-th time sequence uniform partition, the specific quantification mode is obtained by the difference of the attributes appearing in the continuous day and the attribute weight and the number of the attributes with the difference, the larger the difference of the attributes appearing in the continuous day is, the stronger the fluctuation of the attribute corresponding to the data is, and the stronger the fluctuation of the attribute is, the poorer the quality of the data corresponding to the attribute in the day is. Then, negating by using a negative exponential function of e, when the fluctuation of the data is larger, the stability of the attribute corresponding to the data after negating is weaker, otherwise, the data is negated, then, the number of the attributes appearing in the time sequence uniform partition is multiplied by the corresponding weight value to serve as the label of the data in the current partition, the number of the attributes in the partition is larger, the corresponding weight value is larger,

the larger the size, the better the quality of the data in the partition, so the product of the two parts is used as the label C of the partition _h ，C _h The larger the size, the better the quality of the data in the partition, and vice versa.

The partition quality labels are calculated for all the H time sequence uniform partitions by using the above mode, the larger the partition quality label is, the more attributes with larger attribute weights are contained in different big data in different days in the corresponding time sequence uniform partition are, and the more stable the attributes of the different big data contained in different days appear, the better the quality of the big data in the time sequence uniform partition is relatively, and the partition quality labels of all the time sequence uniform partitions are obtained.

(4) And acquiring a time sequence data quality curve of the whole data warehouse data according to the partition quality label.

After the partition quality labels of all time sequence uniform partitions are obtained, calculating a time sequence data quality curve of the whole data warehouse data by using the partition quality labels of all time sequence uniform partitions, wherein the specific calculation formula is as follows:

wherein f (h) represents a time series data quality curve of the data of the whole data warehouse, C _h A partition quality label for h time sequence uniform partition; h is the h-th time sequence uniform partition, and h is the [1,H ]](ii) a H is the total number of all time sequence uniform partitions; e is a natural constant.

f (h) represents a time series data quality curve for the entire data warehouse data, the time series data quality curve being C for all time series uniform partitions in the data warehouse data corresponding to data in the time series uniform partitions for all big data on a time-based profitability basis _h As a result of curve fitting, the theoretical support is that the longer the time for inputting large data is, the smaller the corresponding time-sequence uniform partition h is, and the lower the profitability for providing data assistance for the digital twin, such as data about surface defects of a ten-year-old screw, and the lower the influence degree of the data in the design of the ten-year-old screw in the same field, the lower the profit from the data.

Different big data at different times is of different importance to the digital twin for each data in the industrial data warehouse (i.e. timeliness of big data), in particular the less close the time used by the user, the weaker the effect, the lower the benefit, but not completely invalid, so the invention constrains it by a monotonically bounded increasing function over infinite time, i.e. by a monotonically bounded function over infinite time

The constraints are made such that the closer the generation time of the data is to the metric time, the higher the gain, and the closer the generation time is to the metric time, the lower the gain. The quality curve of large data at different times is represented by its product with the quality label of the partition.

Data quality curves of the digital twin with respect to all data warehouses are obtained using the above described approach.

And S004, obtaining a large data quality parameter according to the time sequence data quality curve.

After the quality curves f (h) of all big data in the industrial data warehouse are obtained, the big data quality curves are used for carrying out quality measurement on the big data, specifically, the big data quality parameters D are calculated by using the big data quality curves, and the big data quality parameter calculation formula is as follows:

in the formula, D is a big data quality parameter in data warehouse data; h is the h time sequence uniform partition; f (h) time series data quality curve of the whole data warehouse data; dh is infinitesimal; h is the total number of uniform partitions in all cases.

Formula logic: the yield curve of the big data in the industrial data warehouse can not be clearly and intuitively measured on the quality of the big data, the data statistics time in the data warehouse is different, and the quality of the big data can be influenced during quantification, so that the quality of the big data is measured by integrating the quality curve of the big data and then calculating the average value, and the influence of the statistics time on the quality of the big data is smaller in a more intuitive angle. The larger the quality parameter D, the better the quality of the big data in the data warehouse, and vice versa.

And S005, re-optimizing the digital twin system and establishing a digital twin model according to the data warehouse data with the maximum large data quality parameter.

In the full life cycle of the digital twin system, after the large data quality parameters of all data warehouse data are obtained, the data warehouse data with the largest large data quality parameter are selected to carry out re-optimization on the digital twin system and establishment of a digital twin model, the re-optimization of the digital twin system can be modeling iterative upgrade on the digital twin body by using the data warehouse data with the largest large data quality parameter and can be used for training and optimizing an industry mechanism model in the digital twin system, the optimization process is not the protection focus of the invention and is not described in detail.

In conclusion, the quality measurement of the big data is carried out by utilizing the characteristic attributes in the data, so that the quality measurement of the big data can be carried out under the condition of greatly reducing the calculation amount, and high-quality data support is provided for the optimization of the digital twin system. The method is based on the partitioned time sequence type calculation of the data, can greatly weaken the influence of time on the data quality measurement, and is higher in accuracy.

It should be noted that: the precedence order of the above embodiments of the present invention is only for description, and does not represent the merits of the embodiments. And specific embodiments thereof have been described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments.

The above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; the modifications or substitutions do not make the essence of the corresponding technical solutions deviate from the technical solutions of the embodiments of the present application, and are included in the protection scope of the present application.

Claims

1. A big data based data quality metric method, the method comprising:

s1: acquiring digital twin data in an industrial data warehouse;

s2: respectively taking different time intervals as unit time nodes, obtaining the data volume size difference corresponding to each time node according to the data volume difference of adjacent time, and obtaining time sequence uniform partitions according to the data volume size difference;

2. The big-data-based data quality measurement method according to claim 1, wherein the step of obtaining the data size difference corresponding to each time node comprises:

in the formula, h ₁ Indicating that the day is the h-th in a time node ₁ A time node, wherein H ₁ The number of the maximum nodes taking one day as a time node;

denotes the h th ₁ -the amount of data in the data warehouse in 1 time node;

3. The big-data-based data quality metric method according to claim 1, wherein the step of obtaining data attributes in a data warehouse comprises:

A＝{A ₁ ,A ₂ …,A _b ,…,A _B }

in the formula, B represents the total number of the attributes extracted by the relevant digital twin in the data warehouse; a. The _b Representing the b-th attribute.

4. The big-data-based data quality measurement method according to claim 3, wherein the step of calculating the attribute weights of all the attributes comprises:

and calculating attribute weights of all attributes.

5. The big-data-based data quality measurement method according to claim 1, wherein the step of obtaining attribute volatility of each attribute with respect to a previous day according to the number of occurrences of each attribute in each time-sequential uniform partition and the attribute weight corresponding to each attribute comprises:

Indicating that the ith time-ordered uniform partition occurred on day t but was inThe b' th attribute not appearing on day t-1

6. The big-data-based data quality metric method according to claim 1, wherein the obtaining of the partition quality label of each time-sequential uniform partition comprises:

in the formula, C _h The h time sequence is divided into the partition quality labels of the uniform partitions; t represents the t day in the h time-series uniform partition; t represents the total number of days in the h time-ordered uniform partition; delta A _t Representing attribute volatility of the t day in the h time-series uniform partition relative to the previous day;

w _b Representing the attribute weight corresponding to the b-th attribute;

7. The big-data-based data quality metric method as claimed in claim 1, wherein the time-series data quality curve of the whole data warehouse data is obtained according to a monotone increasing model constructed by each time-series uniform partition and the partition quality label thereof.

8. The big-data-based data quality metric method of claim 1, wherein the big-data quality parameter obtained from the time-series data quality curve is obtained by integrating the time-series data quality curve.

9. The big-data-based data quality metric method as claimed in claim 1, wherein the step of re-optimizing the digital twin system and establishing the digital twin model by selecting the corresponding best data warehouse data according to the big-data quality parameter comprises: