CN115718744B

CN115718744B - Data quality measurement method based on big data

Info

Publication number: CN115718744B
Application number: CN202211499047.8A
Authority: CN
Inventors: 杨道平; 胡礼波
Original assignee: Beijing Zhonghang Lutong Technology Co ltd
Current assignee: Beijing Zhonghang Lutong Technology Co ltd
Priority date: 2022-11-28
Filing date: 2022-11-28
Publication date: 2023-07-21
Anticipated expiration: 2042-11-28
Also published as: CN115718744A

Abstract

The invention relates to the field of digital twinning, in particular to a data quality measurement method based on big data, which comprises the following steps: twin data in an industrial data warehouse are acquired and data preprocessing is carried out; acquiring the data size difference corresponding to each time node, and carrying out time sequence uniform partitioning on twin big data according to the data size difference; acquiring data attributes in an industrial data warehouse, calculating attribute weights of all the attributes, acquiring partition quality labels of each time sequence uniform partition according to the attribute weights, and acquiring time sequence data quality curves of the whole data warehouse according to the partition quality labels; obtaining big data quality parameters according to the time sequence data quality curve; and finally, selecting data warehouse data in different fields in the optimal full life cycle corresponding to the digital twin body according to the large data quality parameters, and optimizing the digital twin system. Compared with the existing method for measuring the data quality, the method improves accuracy of twin big data quality measurement.

Description

Data quality measurement method based on big data

Technical Field

The invention relates to the field of digital twinning, in particular to a data quality measurement method based on big data.

Background

Along with the development of technology, the information age has come, and digital twin technology is vigorously developed under the promotion of technology, and is widely applied to various industries in the industrial field. The digital twin technology is a very important technology for the field of industrial manufacturing, and a digital twin model is established for corresponding industrial physical entities by using the digital twin technology, so that the process and the behavior of the entities in a physical field can be comprehensively described, mapped, monitored, diagnosed and optimized. And the establishment of the corresponding digital twin model requires the support by utilizing big data of multiple dimensions. The quality of big data is often important to the establishment of a digital twin model, under the condition that the quality of big data of a certain dimension is poor, the digital twin model established according to the big data is often inaccurate, certain deviation can be caused to the description, mapping, monitoring, diagnosis and optimization of the entity, the process and the behavior of a certain dimension in a physical field, and under the condition that the quality of the big data is good, the established digital twin model is often more accurate, the negative influence cannot occur, so that the high-quality big data is an important supporting point for the establishment of the digital twin system optimization and the digital twin model.

In the prior art, when the quality of big data is measured, the whole integrity and timeliness of the big data are often measured, and the measurement mode has certain advantages for the whole big data, but is often not objective and accurate enough when the digital twin system is optimized through twin big data. When the digital twin system is optimized later, a large data set which is unsuitable for selection is easily selected due to insufficient accuracy and objectivity of data quality, so that the digital twin system optimization and the digital twin model establishment are greatly influenced.

According to the invention, on the basis of carrying out time sequence data quantity uniform partition on the twin big data, digital twin attribute data in the big data are extracted, then quality measurement of the big data is carried out on the extracted attribute data in different partitions, then the quality measurement of the whole twin big data is realized through the quality of the big data of different partitions, finally the twin big data are selected according to measurement results, and twin big data with higher quality are selected to carry out digital twin system optimization and digital twin model establishment.

Disclosure of Invention

In order to solve the above problems, the present invention provides a data quality measurement method based on big data, the method comprising:

s1: acquiring digital twin data in an industrial data warehouse;

s2: respectively taking different time intervals as unit time nodes, obtaining the data volume difference corresponding to each time node according to the data volume difference of adjacent time, and obtaining time sequence uniform partitions according to the data volume difference;

s3: acquiring data attributes in a data warehouse, and calculating attribute weights of all the attributes;

s4: obtaining attribute volatility of each attribute relative to the previous day according to the occurrence number of each attribute in each time sequence uniform partition and the attribute weight corresponding to each attribute;

s5, obtaining labels of data in each time sequence uniform partition according to the number of each attribute in each time sequence uniform partition and the attribute weight corresponding to each attribute; obtaining a partition quality label of each time sequence uniform partition according to the label and attribute fluctuation of the data in each time sequence uniform partition, and obtaining a time sequence data quality curve of the whole data warehouse data according to the partition quality label;

s6: obtaining big data quality parameters according to the time sequence data quality curve;

s7: and selecting corresponding optimal data warehouse data according to the big data quality parameters to re-optimize the digital twin system and establishing a digital twin model.

Preferably, the step of obtaining the data size difference corresponding to each time node and obtaining the time sequence uniform partition according to the data size difference includes:

the size difference calculation formula of the data volume corresponding to the node by taking one day as a time node is as follows:

in the formula, h ₁ Representing the h in a time of day node ₁ Time node, where H ₁ The number of maximum nodes taking one day as one time node;represents the h ₁ The amount of data in the data warehouse in each time node; />Represents the h ₁ -the amount of data in the data warehouse in 1 time node;

data size difference D using two days as a time node ₂ And so on to obtain T E [1, Δt ]]And (2) andMAXT represents the number of days of existence of all data in the industrial data warehouse;

when T days are taken as a time node, the corresponding data size is different D _T At the minimum, every T days is one hourThe inter-nodes uniformly partition the data in all the data warehouses.

Preferably, the step of acquiring the data attribute in the data warehouse includes:

the attribute extraction of the data related to the digital twin is completed in the data warehouse by using a named body recognition technology, and the collection of the related attributes A related to the digital twin is obtained as follows:

A＝{A ₁ ,A ₂ …,A _b ,…,A _B }

wherein B represents the total number of the attributes extracted from the related digital twin in the data warehouse; a is that _b Representing the b-th attribute.

Preferably, the step of calculating attribute weights of all attributes includes:

for attribute b A _b Its attribute weight w _b Is attribute A _b The ratio of the number of occurrences in the data warehouse to the total number of attributes extracted from the digital twins in the data warehouse;

and calculating attribute weights of all the attributes.

Preferably, the step of obtaining the attribute volatility of each attribute with respect to the previous day according to the number of occurrences of each attribute in each time sequence uniform partition and the attribute weight corresponding to each attribute includes:

the calculation formula of the attribute of the t day in the h time sequence uniform partition relative to the attribute fluctuation of the previous day is as follows:

wherein DeltaA _t Representing attribute volatility of the t-th day relative to the former day in the h-th time-series uniform partition; b' _t Attribute b' representing attribute appearing on the t-th day but not appearing on the t-1 th day in the h-th time-series uniform partition B' th attribute +_representing occurrence of the t-th day but not occurrence of the t-1 th day in the h-th timing uniform section>A corresponding number; w (w) _b For attribute b A _b Attribute weights of (a); b' _t Represents the b' th attribute A which does not appear on the t-th day but appears on the t-1 th day in the h time-series uniform partition _b” ；/>Represents the b' th attribute A which does not appear on the t-th day but appears on the t-1 th day in the h time-series uniform partition _b” Exp () represents an exponential function based on a natural constant.

Preferably, the step of obtaining the partition quality label of each time sequence uniform partition includes:

for the h time sequence uniform partition, the partition quality label calculation formula of the h time sequence uniform partition is as follows:

wherein C is _h The h time sequence is used for uniformly partitioning the partition quality labels; t represents the t-th day in the h-th time-series uniform partition; t represents the total number of days in the h time-series uniform partition; ΔA _t Representing attribute volatility of the t-th day relative to the former day in the h-th time-series uniform partition;the number of the b-th attribute showing the occurrence of the t-th day in the h-th time-series uniform partition is +.>w _b Representing attribute weight corresponding to the b-th attribute;

and calculating partition quality labels of all time sequence uniform partitions.

Preferably, the time sequence data quality curve of the whole data warehouse data is obtained according to a monotonically increasing model constructed by each time sequence uniform partition and the partition quality label thereof.

Preferably, the obtaining the big data quality parameter according to the time sequence data quality curve is obtained by integrating the time sequence data quality curve.

Preferably, the step of selecting the corresponding optimal data warehouse data according to the big data quality parameter to re-optimize the digital twin system and establish the digital twin model includes:

and calculating large data quality parameters in all relevant data warehouses in the whole life cycle generated by the digital twin system, selecting data warehouse data with the largest quality parameters to re-optimize the digital twin system, and establishing a digital twin model.

The embodiment of the invention has the following beneficial effects:

1. the benefit of this application over the prior art is: the quality of the entire data is measured not by using all the data but by using the characteristic attribute in the data, and the quality of the large data can be measured with a large amount of reduction in the calculation amount.

2. The benefit of this application over the prior art is: the method can utilize the related data generated in the operation process of the digital twin system to carry out data measurement, and has higher accuracy when the data measurement is compared with the prior art. The method is based on the partition time sequence calculation of the data, and can greatly weaken the influence of time on the data quality measurement, so that the subsequent digital twin system optimization provides high-quality data support.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions and advantages of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart illustrating a data quality measurement method based on big data according to an embodiment of the present invention.

Detailed Description

In order to further describe the technical means and effects adopted by the present invention to achieve the preset purpose, the following detailed description refers to the specific implementation, structure, characteristics and effects of a data quality measurement method based on big data according to the present invention with reference to the accompanying drawings and preferred embodiments. In the following description, different "one embodiment" or "another embodiment" means that the embodiments are not necessarily the same. Furthermore, the particular features, structures, or characteristics of one or more embodiments may be combined in any suitable manner.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

The following specifically describes a specific scheme of the data quality measurement method based on big data provided by the invention with reference to the accompanying drawings.

Referring to fig. 1, a flowchart illustrating a data quality measurement method based on big data according to an embodiment of the present invention is shown, the method includes the following steps:

s001, acquiring an industrial data warehouse related to the digital twin body, and preprocessing data in the data warehouse.

Because the quality measurement of big data is needed according to the data in the digital warehouse, the storage of the industrial data warehouse is needed according to the digital twin data generated by the data twin system, and the structuring processing is carried out through the storage data in the data warehouse, and the data structure needed in the embodiment is as follows: the statistical time + data entry of the data warehouse, for example, after the log information of the digital twin is imported into the data warehouse, the generation time of the log data is the statistical time, the specification of the vocabulary in the data warehouse is the data entry of the data warehouse, the log information of the digital twin comprises the log data generated by the digital twin system in the whole life cycle of industrial product design, process, production, operation and maintenance, and the like, and the digital twin data can be the data generated by any link in the whole life cycle, for example, the data generated by the industrial product design link is one of the digital twin data. The above structure is used for processing all acquired big data in the industrial data warehouse so as to measure the quality of the big data later, and the embodiment is implemented by taking one data warehouse as a research object.

And S002, acquiring the data volume difference corresponding to different time nodes, and carrying out time sequence uniform partition on the big data according to the data volume difference.

First, time sequence uniform partitioning of equal data quantity based on time sequence is carried out on big data in an industrial data warehouse, so that influence of the data quantity size on the big data in the industrial data warehouse is reduced.

Among quality metrics of large data, data size is an important influencing factor, and the influence of data sizes with different sizes on the measurement of large data is larger, and the larger the data size is, the larger the quality of large data is correspondingly spread, and the sizes of data sizes at different continuous time points in time sequence are different, so that the larger influence is caused when the quality metrics of data are carried out. The invention utilizes the data quantity in the time nodes which are uniformly distributed to partition the big data in the whole industrial data warehouse, so that the data quantity in each interval is based on the same size, and the time span length of each interval is equal, thereby reducing the influence on the data quality measurement caused by different data quantity in the follow-up. The data in the data warehouse is counted by taking the day as a unit, so the time node of the time sequence disclosed by the invention takes the day as a node, and the data of the data warehouse is partitioned by the time node which is uniformly distributed, and the specific process is as follows:

first, data size difference D of nodes is carried out by taking one day as a time node ₁ Is calculated by (1):

wherein: h is a ₁ Representing the h in a time of day node ₁ Time node H ₁ To get H in total when one day is taken as a time node ₁ And a time node.Represents the h ₁ The amount of data in the data warehouse in each time node,represents the h ₁ -the amount of data in the data warehouse in 1 time node.

Wherein D is ₁ The larger the data quantity in the data warehouse is quantified by taking one day as a time node, the larger the difference value of the data quantity is, the larger the influence is on the subsequent data quality measurement is, because the larger the data quantity is, the stronger the diversity of the attribute is, and the higher the frequency of single attribute is; when the data amount is too small, the weaker the diversity of the attributes, the lower the frequency of occurrence of a single attribute. If the difference of the data amount per day is larger, the frequency of occurrence of different attributes is different, the quality measurement for each interval will be inaccurate in the following, and vice versa.

D, taking two days as a time node to conduct data size difference of each node ₂ Is calculated by (1):

wherein: h is a ₂ Representing the h in a time node with two days as one time node ₂ Time node H ₂ To maximize the number of nodes with two days as one time node,represents the h ₂ Data amount in data warehouse in each time node, < >>Represents the h ₂ -the amount of data in the data warehouse in 1 time node.

D ₂ The larger the data amount in the data warehouse is quantified by taking 2 days as a time node, the larger the difference value of the data amount is, and the larger the influence is on the subsequent data quality measurement, and the opposite is.

In the same way, toCalculation D _Δt Obtain the set { D } ₁ ,D ₂ ,…,D _T ,…,D _Δt [1, Δt ]]And D is _T Is minimal.

To this end, the MAXT is divided into H sequential uniform partitions in units of T, and the sequential uniform partition refers to partitioning all data by taking the same number of days T as a time node, and after the data is partitioned by the time node, the data amount in each partition is basically equal, so the sequential uniform partition is called.

S003, acquiring data attributes in a data warehouse, calculating attribute weights of all the attributes, acquiring partition quality labels of each time sequence uniform partition according to the attribute weights, and acquiring time sequence data quality curves of the whole data warehouse data according to the partition quality labels.

The quality of big data is relatively speaking, so firstly, the digital twin volume data is extracted with the attribute, then the attribute weight is calculated by the attribute, the stability of the attribute in each time sequence uniform partition is calculated by the attribute weight and the occurrence frequency of the attribute, the quality label of the big data in each time sequence uniform partition is carried out by the stability of the attribute, the occurrence frequency of different attributes in the big data and the attribute weight corresponding to the attribute, then the quality curve of the big data in the industrial data warehouse is obtained by the quality label of the big data in each time sequence uniform partition, and finally the quality measurement of the big data is carried out by the quality curve, and the specific process is as follows:

(1) Data attributes in a data warehouse are obtained.

The attribute extraction process of the data refers to a process of extracting keywords from related log information related to a digital twin in the data of a data warehouse, and the specific process is as follows: manually marking the related log information of the related digital twin data, and then completing the data attribute extraction of the related digital twin in a data warehouse by using a named body recognition technology, wherein the method can be used for obtaining the set of the related attributes of the related digital twin as follows:

A＝{A ₁ ,A ₂ …,A _b ,…,A _B }

wherein B represents the total number of B and A of the extracted attributes of the related digital twin in the data warehouse _b Representing the b-th attribute.

(2) And calculating attribute weights of all the attributes.

All attributes of the related digital twin in the data warehouse are extracted, since the extracted attributes of the related digital twin are different in objectivity and accuracy in describing the physical entity, and different objectivity and accuracy are different in mass of big data in the data warehouse relative to the digital twin, the objectivity and accuracy of the digital twin data of different attributes need to be quantified, and the manner of quantifying the accuracy and objectivity of the different attributes is to calculate the attribute weight by using the different attributes, so that the b-th attribute A _b For example, its attribute weight w _b The calculation formula of (2) is as follows:

wherein: n is n _b For attribute b A _b The number of occurrences in the data warehouse,for the total number of occurrences of all attributes.

Wherein the quantification of the weight of the attribute in the data is performed in the above manner, in the sense that the more times the attribute appears, the more objective and more accurate the attribute is proved, namely w _b The larger the property A of the twins is _b The more objective and accurate the description of the relevant physical entity in the data warehouse, the higher the importance degree, and the larger the weight value correspondingly.

And calculating the attribute weights of all the extracted attributes by using the mode, so that B attribute weights corresponding to the B attributes can be obtained.

(3) And obtaining the partition quality label of each time sequence uniform partition according to the attribute weight.

After the attribute weights of all the attributes are obtained, since each interval in the time sequence uniform partition contains big data counted in different days, the attributes in each piece of data in the big data of each day are not necessarily the same, and the different attribute weights are different, the partition labels of each time sequence uniform partition are quantized for the attributes in all the big data in each time sequence uniform partition, taking the h time sequence uniform partition as an example, and the partition quality label calculation formula of the h time sequence uniform partition is as follows:

wherein C is _h The h time sequence is used for uniformly partitioning the partition quality labels; t represents the t-th day in the h-th time-sequence uniform partition, and t is [1, T ]]；ΔA _t Representing attribute volatility of the t-th day relative to the former day in the h-th time-series uniform partition;representing the number of b-th attributes appearing on the t-th day in the h time sequence uniform partition; w (w) _b Representing attribute weight corresponding to the b-th attribute;

wherein DeltaA _t The calculation formula of (2) is as follows:

wherein DeltaA _t Representing attribute volatility of the t-th day relative to the former day in the h-th time-series uniform partition; b' _t Represents the b' th attribute among attributes appearing on the t-th day but not appearing on the t-1 th day in the h-th time-series uniform partition B' th attribute +_representing occurrence of the t-th day but not occurrence of the t-1 th day in the h-th timing uniform section>A corresponding number; w (w) _b For attribute b A _b Attribute weights of (a); b' _t Represents the b' th attribute A which does not appear on the t-th day but appears on the t-1 th day in the h time-series uniform partition _b” ；/>Represents the b' th attribute A which does not appear on the t-th day but appears on the t-1 th day in the h time-series uniform partition _b” Is the number of (3);

ΔA in the formula _t The calculation process is calculated by the fluctuation of the attribute corresponding to the data in the h time sequence uniform partition, and the specific quantization mode is that the difference of the attribute appearing in the continuous days and the attribute weight and the number of the attribute with the difference are obtained, the larger the difference of the attribute appearing in the continuous days is, the stronger the fluctuation of the attribute corresponding to the data is, and the stronger the fluctuation of the attribute is, the poorer the quality of the data corresponding to the attribute in the continuous days is. Then, the negative exponential function of e is used for inverting, and when the fluctuation of the data is larger, the inversion is performedAnd then the weaker the stability of the attribute corresponding to the data is, the opposite is, and then the label of the data in the current partition is obtained by taking the product of the quantity of all the attribute occurrences in the s-th time sequence uniform partition and the corresponding weight, the more the quantity of the attribute in the partition is, the larger the corresponding weight is,the larger the data quality in the partition is, the better the product of the two parts is used as the label C of the partition _h ，C _h The larger the quality of the data within the partition, the better the quality, and vice versa.

The method is used for calculating the partition quality labels of all H time sequence uniform partitions, the larger the partition quality labels are, the more the corresponding attributes with larger attribute weights are contained in different big data in different days in the time sequence uniform partitions, and the more the attributes of different big data contained in different days appear stably, the better the quality of the big data in the time sequence uniform partitions is relatively, and the partition quality labels of all the time sequence uniform partitions are obtained.

(4) And acquiring a time sequence data quality curve of the whole data warehouse data according to the partition quality label.

After all the partition quality labels of the time sequence uniform partition are obtained, the time sequence data quality curve of the whole data warehouse data is calculated by using the partition quality labels of all the time sequence uniform partition, and the specific calculation formula is as follows:

wherein f (h) represents a time-series data quality curve of the whole data warehouse data, C _h Partition quality labels for h time sequence uniform partitions; h is the h time sequence uniform partition, and h is E [1, H]The method comprises the steps of carrying out a first treatment on the surface of the H is the total number of all sequential uniform partitions; e is a natural constant.

f (h) represents a time series data quality curve of the whole data warehouse data, and the time series data qualityThe curve is C for all time-sequential uniform partitions in data warehouse data corresponding to data in time-sequential uniform partitions for all large data on a time-based profitability basis _h As a result of curve fitting, the longer the entry time of the theoretical support is, the smaller the corresponding time sequence uniform partition h is, the lower the benefit of providing data assistance for the current relevant digital twin body, for example, the data about surface defects of the screw ten years ago, the lower the influence degree of the data is in the design of the screw in the same field today ten years later, and the lower the benefit from the data is.

In the case of each data in an industrial data warehouse, the different big data at different times are of different importance to the digital twin (i.e. the timeliness of the big data), in particular the weaker the time effect used by the user the less the benefit, but not the complete ineffectiveness, so the invention constrains it by a monotonic bounded increasing function over an infinite period of time, i.e. byConstraints are made such that the closer the generation time of the data is to the metric time, the higher the benefit thereof, the less close the benefit thereof is to the metric time. The quality curve of big data at different times is represented by its product with the quality label of the partition.

A data quality curve for all data warehouses associated with the digital twin is obtained using the above-described approach.

S004, obtaining large data quality parameters according to the time sequence data quality curve.

After the quality curves f (h) of all the big data in the industrial data warehouse are obtained, the big data quality curves are used for carrying out quality measurement of the big data, specifically, the big data quality parameters D are calculated by using the big data quality curves, and the big data quality parameters have the calculation formula:

wherein D is a big data quality parameter in data warehouse data; h is the h time sequence uniform partition; f (h) a time-series data quality curve of the whole data warehouse data; dh is a infinitesimal; h is the total number of all time-uniform partitions.

Formula logic: the profit curve of big data in the industrial data warehouse can not clearly and intuitively measure the quality of the big data, and the quality of the big data can be influenced when the statistics time of the data in the data warehouse is different, so the quality of the big data is measured by integrating the quality curve of the big data and then taking an average value, and the influence of the statistics time on the quality of the big data is smaller in a more intuitive angle. The larger the quality parameter D, the better the quality of the big data in the data warehouse, and vice versa.

S005, re-optimizing the digital twin system and establishing a digital twin model according to the data warehouse data with the maximum big data quality parameters.

And in the full life cycle of the digital twin system operation, after the big data quality parameters of all the data warehouse data are acquired, the data warehouse data with the largest big data quality parameters are selected to re-optimize the digital twin system and establish a digital twin model, wherein the re-optimization of the digital twin system can be realized by modeling and iterative upgrading of the digital twin body by utilizing the data warehouse data with the largest big data quality parameters and training and optimizing of an industry mechanism model in the digital twin system, and the optimization process is not a protection key point of the invention and is not described in detail.

In summary, the invention can perform quality measurement of big data by utilizing the characteristic attribute in the data, and can provide high-quality data support for optimizing the digital twin system under the condition of greatly reducing the calculation amount. The method is based on the data partition time sequence calculation, can greatly weaken the influence of time on the data quality measurement, and has higher accuracy.

It should be noted that: the sequence of the embodiments of the present invention is only for description, and does not represent the advantages and disadvantages of the embodiments. And the foregoing description has been directed to specific embodiments of this specification. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments.

The above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the scope of the embodiments of the present application, and are intended to be included within the scope of the present application.

Claims

1. A data quality metric method based on big data, the method comprising:

s1: acquiring digital twin data in an industrial data warehouse;

s5, obtaining labels of data in each time sequence uniform partition according to the number of each attribute in each time sequence uniform partition and the attribute weight corresponding to each attribute; obtaining a partition quality label of each time sequence uniform partition according to the label and attribute fluctuation of the data in each time sequence uniform partition, and obtaining a time sequence data quality curve of the whole data warehouse according to the partition quality label;

s7: selecting corresponding optimal data warehouse data according to the big data quality parameters, re-optimizing the digital twin system and establishing a digital twin model;

the step of obtaining the data size difference corresponding to each time node and obtaining the time sequence uniform partition according to the data size difference comprises the following steps:

data size difference D using two days as a time node ₂ And so on to obtain T E [1, Δt ]]And (2) andMAXT represents the number of days of existence of all big data in the industrial data warehouse;

when T days are taken as a time node, the corresponding data size is different D _T And at the minimum, uniformly partitioning the data in all the data warehouses by taking every T days as a time node.

2. The big data based data quality metric method of claim 1, wherein the step of obtaining data attributes in the data warehouse comprises:

A＝{A ₁ ,A ₂ …,A _b ,…,A _B }

3. The big data based data quality metric method of claim 2, wherein the step of calculating the attribute weights of all attributes comprises:

and calculating attribute weights of all the attributes.

4. The method for measuring data quality based on big data according to claim 1, wherein the step of obtaining the attribute volatility of each attribute with respect to the previous day according to the number of occurrences of each attribute in each time sequence uniform partition and the attribute weight corresponding to each attribute comprises:

5. The method for measuring data quality based on big data according to claim 1, wherein the step of obtaining the partition quality label of each time-series uniform partition comprises:

6. The big data based data quality metric method of claim 1, wherein the time series data quality curve of the whole data warehouse is obtained according to a monotonically increasing model constructed by each time series uniform partition and its partition quality label.

7. The method of claim 1, wherein the obtaining the big data quality parameter from the time series data quality curve is obtained by integrating the time series data quality curve.

8. The method of claim 1, wherein the steps of selecting the corresponding best data warehouse data for re-optimizing the digital twinning system and establishing the digital twinning model according to the big data quality parameters comprise: