CN115718744A - Data quality measurement method based on big data - Google Patents

Data quality measurement method based on big data Download PDF

Info

Publication number
CN115718744A
CN115718744A CN202211499047.8A CN202211499047A CN115718744A CN 115718744 A CN115718744 A CN 115718744A CN 202211499047 A CN202211499047 A CN 202211499047A CN 115718744 A CN115718744 A CN 115718744A
Authority
CN
China
Prior art keywords
data
attribute
time
quality
partition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211499047.8A
Other languages
Chinese (zh)
Other versions
CN115718744B (en
Inventor
杨道平
胡礼波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhonghang Lutong Technology Co ltd
Original Assignee
Beijing Zhonghang Lutong Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhonghang Lutong Technology Co ltd filed Critical Beijing Zhonghang Lutong Technology Co ltd
Priority to CN202211499047.8A priority Critical patent/CN115718744B/en
Publication of CN115718744A publication Critical patent/CN115718744A/en
Application granted granted Critical
Publication of CN115718744B publication Critical patent/CN115718744B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to the field of digital twinning, in particular to a data quality measurement method based on big data, which comprises the following steps: twin data in an industrial data warehouse are obtained and are subjected to data preprocessing; acquiring the difference of the data quantity corresponding to each time node, and performing time sequence uniform partitioning on twin big data according to the difference of the data quantity; acquiring data attributes in an industrial data warehouse, calculating attribute weights of all the attributes, acquiring a partition quality label of each time sequence uniform partition according to the attribute weights, and acquiring a time sequence data quality curve of the whole data warehouse according to the partition quality labels; obtaining a big data quality parameter according to the time sequence data quality curve; and finally, selecting data warehouse data in different fields in the optimal full life cycle corresponding to the digital twin according to the big data quality parameters, and optimizing the digital twin system. Compared with the existing method for measuring the data quality, the method improves the accuracy of measuring the quality of the twin big data.

Description

Data quality measurement method based on big data
Technical Field
The invention relates to the field of digital twinning, in particular to a data quality measurement method based on big data.
Background
With the development of science and technology, the information age has come, and the digital twin technology is vigorously developed under the promotion of science and technology and is widely applied to various industries in the industrial field. The digital twinning technology is an important technology for the field of industrial manufacturing, and the digital twinning technology is used for establishing a digital twinning model for a corresponding industrial physical entity, so that the process and the behavior of the entity in a physical field can be comprehensively described, mapped, monitored, diagnosed and optimized. And the establishment of the corresponding digital twin model needs to be supported by using large data with multiple dimensions. The quality of big data is often more important to the establishment of a digital twin model, the digital twin model established according to the big data is often not accurate enough under the condition that the quality of the big data of a certain dimensionality is poor, certain deviation can be caused to the description, mapping, monitoring, diagnosis and optimization of the certain dimensionality of an entity, a process and a behavior in a physical field, the digital twin model established under the condition that the quality of the big data is good is often more accurate, the negative influence can not occur, and therefore the big data with high quality is an important support point for the optimization of the digital twin system and the establishment of the digital twin model.
When the quality of big data is measured in the prior art means, the measurement is usually carried out through the integrity and timeliness of the whole big data, and the measurement mode has certain advantages for the whole big data, but is not objective and accurate when a digital twin system is optimized through the twin big data. When the digital twin system optimization is carried out subsequently, an inappropriate large data set is often selected easily because of insufficient data quality accuracy and objectivity, and great influence is caused on the digital twin system optimization and the digital twin model establishment.
According to the method, digital twin biological attribute data in the big data are extracted on the basis of carrying out uniform partitioning on the time sequence data quantity of the twin big data, the quality of the big data is measured in different partitions by the extracted attribute data, the quality of the whole twin big data is measured by the qualities of the big data in different partitions, the twin big data is selected according to a measurement result, and the twin big data with higher quality is selected to carry out digital twin system optimization and establishment of a digital twin model.
Disclosure of Invention
In order to solve the above problem, the present invention provides a data quality measurement method based on big data, the method comprising:
s1: acquiring digital twin data in an industrial data warehouse;
s2: respectively taking different time intervals as unit time nodes, obtaining the difference of the data quantity corresponding to each time node according to the data quantity difference of adjacent time, and obtaining time sequence uniform partitions according to the difference of the data quantity;
s3: acquiring data attributes in a data warehouse, and calculating attribute weights of all the attributes;
s4: obtaining attribute volatility of each attribute relative to the previous day according to the number of each attribute in each time sequence uniform partition and the attribute weight corresponding to each attribute;
s5, obtaining a label of the data in each time sequence uniform partition according to the number of each attribute in each time sequence uniform partition and the attribute weight corresponding to each attribute; obtaining a partition quality label of each time sequence uniform partition according to the label and attribute fluctuation of the data in each time sequence uniform partition, and obtaining a time sequence data quality curve of the data of the whole data warehouse according to the partition quality label;
s6: obtaining a big data quality parameter according to the time sequence data quality curve;
s7: and selecting corresponding optimal data warehouse data according to the big data quality parameters to re-optimize the digital twin system and establish a digital twin model.
Preferably, the step of obtaining the difference of the data size corresponding to each time node and obtaining the time-sequence uniform partition according to the difference of the data size includes:
the difference calculation formula of the corresponding data quantity of the node by taking one day as a time node is as follows:
Figure BDA0003966278000000021
in the formula, h 1 Indicating that the day is the h-th in a time node 1 A time node, whichMiddle H 1 The number of the maximum nodes taking one day as a time node;
Figure BDA0003966278000000022
denotes the h th 1 The amount of data in the data warehouse in each time node;
Figure BDA0003966278000000023
denotes the h-th 1 -the amount of data in the data warehouse in 1 time node;
data volume size difference D with two days as one time node 2 And the analogy is carried out to obtain the T epsilon [1, delta T ∈]And is and
Figure BDA0003966278000000024
MAXT represents the number of days of existence of all data in the industrial data warehouse;
when T days are taken as a time node, the corresponding data volume difference D T And when the minimum time is less than the preset time, uniformly partitioning the data in all the data warehouses by taking every T days as a time node.
Preferably, the step of obtaining data attributes in the data warehouse comprises:
the attribute extraction of data related to the digital twin is completed in a data warehouse by using a named body identification technology, and the collection of related attributes A related to the digital twin is obtained as follows:
A={A 1 ,A 2 …,A b ,…,A B }
in the formula, B represents the total number of the attributes extracted from the related digital twin in the data warehouse; a. The b Representing the b-th attribute.
Preferably, the step of calculating the attribute weights of all the attributes includes:
for the b-th attribute A b Its attribute weight w b Is attribute A b The ratio of the number of times of occurrence in the data warehouse to the total number of attributes extracted by the digital twins in the data warehouse;
and calculating attribute weights of all attributes.
Preferably, the step of obtaining attribute volatility of each attribute relative to the previous day according to the number of occurrences of each attribute in each time sequence uniform partition and the attribute weight corresponding to each attribute includes:
the calculation formula of the attribute fluctuation of the t day in the h time sequence uniform partition relative to the attribute fluctuation of the previous day is as follows:
Figure BDA0003966278000000031
in the formula,. DELTA.A t Representing attribute volatility of the t day in the h time-ordered uniform partition relative to the attribute of the previous day; b' t B' th attribute representing an attribute that appears at day t in the h-th time-sequential uniform partition but does not appear at day t-1
Figure BDA0003966278000000032
Figure BDA0003966278000000033
B' th attribute representing occurrence on day t but no occurrence on day t-1 in the h time-sequential uniform partition
Figure BDA0003966278000000034
The corresponding number; w is a b Is the b-th attribute A b The attribute weight of (2); b' t B "th attribute A, which indicates that it does not appear on day t but appears on day t-1 in the h time-sequential uniform partition b”
Figure BDA0003966278000000035
B "th attribute A, which indicates that it does not occur in the t-th day but occurs in the t-1 th day in the h-th time-series uniform partition b” Exp () represents an exponential function with a natural constant as the base.
Preferably, the step of obtaining the partition quality label of each time-sequence uniform partition includes:
for the h time sequence uniform partition, the partition quality label calculation formula of the h time sequence uniform partition is as follows:
Figure BDA0003966278000000036
in the formula, C h The h time sequence is divided into the partition quality labels of the uniform partitions; t represents the t day in the h time-sequential uniform partition; t represents the total number of days in the h time-ordered uniform partition; delta A t Representing attribute volatility of the t day in the h time-ordered uniform partition relative to the attribute of the previous day;
Figure BDA0003966278000000041
the number of the b-th attribute appearing at the t day in the h time sequence uniform partition is represented as
Figure BDA0003966278000000042
w b Representing the attribute weight corresponding to the b-th attribute;
and calculating the partition quality labels of all time sequence uniform partitions.
Preferably, the time series data quality curve of the whole data warehouse data is obtained according to a monotone increasing model constructed by each time series uniform partition and the partition quality labels thereof.
Preferably, the obtaining of the large data quality parameter according to the time series data quality curve is obtained by integrating the time series data quality curve.
Preferably, the step of selecting the corresponding optimal data warehouse data according to the big data quality parameter to re-optimize the digital twin system and the step of establishing the digital twin model include:
calculating large data quality parameters in all data warehouses related to the whole life cycle generated by the digital twin system, and selecting the data warehouse data with the maximum quality parameter to re-optimize the digital twin system and establish a digital twin model.
The embodiment of the invention has the following beneficial effects:
1. the benefits of the present application over the prior art are: the quality of the entire data is measured not by all data but by the characteristic attributes of the data, and the quality of the large data can be measured with a large amount of calculation.
2. The benefits of the present application over the prior art are: the data measurement can be carried out by utilizing the related data generated in the running process of the digital twin system, and the accuracy is higher in the data measurement compared with the prior art. And the method is based on the partitioned time-series calculation of the data, and can greatly weaken the influence of time on the data quality measurement, so that high-quality data support is provided for the subsequent optimization of a digital twin system.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions and advantages of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a flowchart illustrating steps of a method for big data based data quality measurement according to an embodiment of the present invention.
Detailed Description
To further illustrate the technical means and effects of the present invention adopted to achieve the predetermined objects, the following detailed description, structures, features and effects of a big data based data quality measurement method according to the present invention are provided with the accompanying drawings and preferred embodiments. In the following description, different "one embodiment" or "another embodiment" refers to not necessarily the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
The following describes a specific scheme of the big data based data quality measurement method provided by the present invention in detail with reference to the accompanying drawings.
Referring to fig. 1, a flowchart illustrating steps of a big data based data quality measurement method according to an embodiment of the present invention is shown, where the method includes the following steps:
and S001, acquiring an industrial data warehouse related to the digital twin, and preprocessing data in the data warehouse.
Since the quality measurement of the big data needs to be performed according to the data in the digital warehouse, the storage of the industrial data warehouse needs to be performed according to the digital twin data generated by the data twin system, and the structuring processing is performed through the stored data in the data warehouse, where the data structure needed in this embodiment is as follows: the statistical time + the data warehouse data entry, for example, after the log information of the digital twin is imported into the data warehouse, the generation time of the log data is the statistical time, the specification of a vocabulary in the data warehouse is the data warehouse data entry, the log information of the digital twin includes log data generated by a digital twin system in a full life cycle of industrial product design, process, production, operation and maintenance, the digital twin data can be data generated in any link in the full life cycle, for example, the data generated in the industrial product design link is one of the digital twin data. The above structure is used to process the big data in all the obtained industrial data warehouses so as to measure the quality of the big data in the following steps, and this embodiment is implemented by using one data warehouse as a research object.
And S002, acquiring data volume size difference corresponding to different time nodes, and performing time sequence uniform partitioning on the big data according to the data volume size difference.
The method comprises the steps of firstly carrying out time sequence uniform partitioning on large data in the industrial data warehouse based on time sequence and equal data quantity to reduce the influence of the data quantity on the large data in the industrial data warehouse.
The data volume is an important influence factor in the quality measurement of the large data, and the data volumes with different sizes have a larger influence on the measurement of the large data, which is particularly shown in that the larger the data volume is, the larger the variation degree of the quality of the large data is, and the sizes of the data volumes at different continuous time points in time sequence are different, so that the quality measurement of the data has a larger influence. Therefore, the invention partitions the big data in the whole industrial data warehouse by using the data volume in the uniformly distributed time nodes, so that the data volume in each interval is equal, and the time span length of each interval is equal, thereby reducing the influence on the data quality measurement caused by different data volumes in the follow-up process. The data in the data warehouse is counted in units of days, so the time nodes of the time sequence of the invention take the days as the nodes, and the data in the data warehouse is evenly distributed by the data partitions of the time nodes, and the specific process is as follows:
firstly, taking one day as a time node to perform data volume difference D of the node 1 The calculation of (2):
Figure BDA0003966278000000061
in the formula: h is 1 Indicating that the day is the h-th in a time node 1 A time node, H 1 To obtain H in total when one day is used as a time node 1 And (5) a time node.
Figure BDA0003966278000000062
Denotes the h th 1 The amount of data in the data warehouse in each time node,
Figure BDA0003966278000000063
denotes the h th 1 The amount of data in the data warehouse in 1 time node.
Wherein D is 1 The larger the data quantity is, the larger the difference value of the data quantity is, the larger the influence is on the subsequent data quality measurement when the data quantity is quantified by taking one day as a time node, because the data quantity is too largeIn time, the stronger the diversity of the attributes is, the higher the frequency of occurrence of a single attribute is; when the data amount is too small, the diversity of attributes is weaker, and the frequency of occurrence of a single attribute is lower. If the difference of the data amount of each day is larger, the occurrence frequency of different attributes is different, and the subsequent quality measurement for each interval is inaccurate, and vice versa.
Then, taking two days as a time node to perform D of difference of data volume of each node 2 The calculation of (2):
Figure BDA0003966278000000064
in the formula: h is 2 Representing two days as the h-th in one time node 2 A time node, H 2 The maximum number of nodes with two days as a time node,
Figure BDA0003966278000000065
denotes the h th 2 The amount of data in the data warehouse in each time node,
Figure BDA0003966278000000066
denotes the h th 2 The amount of data in the data warehouse in 1 time node.
D 2 The larger the difference value of the data quantity is, the larger the influence on the subsequent data quality measurement is, the larger the difference value is, the opposite is, when the data quantity in the data warehouse is quantified by taking 2 days as a time node.
In the same way, with
Figure BDA0003966278000000071
Calculating D Δt And the acquisition set is { D 1 ,D 2 ,…,D T ,…,D Δt In which T ∈ [1, Δ T ]]And D is T Is minimal.
In this case, the MAXT is continuously equally divided into H time-series uniform partitions in units of T, and a time-series uniform partition is referred to as a time-series uniform partition because all data are partitioned by using the same number of days T as time nodes and the data are partitioned by the time nodes so that the data amount in each partition is substantially equal.
And S003, acquiring data attributes in the data warehouse, calculating attribute weights of all the attributes, acquiring a partition quality label of each time sequence uniform partition according to the attribute weights, and acquiring a time sequence data quality curve of the data of the whole data warehouse according to the partition quality labels.
The quality of the big data is relatively speaking, so that firstly, the attribute of the digital twin biological data is extracted, then the attribute weight is used for calculating the attribute weight, the stability of the attribute in each time sequence uniform partition is calculated by using the attribute weight and the occurrence frequency of the attribute, the quality label of the big data in each time sequence uniform partition is carried out by using the stability of the attribute, the occurrence frequency of different attributes in the big data and the attribute weight corresponding to the attribute, then the quality curve of the big data in the industrial data warehouse is obtained by using the quality label of the big data in each time sequence uniform partition, and finally the quality measurement of the big data is carried out by using the quality curve, wherein the specific process is as follows:
(1) Data attributes in a data warehouse are obtained.
The data attribute extraction process is a process of extracting keywords from the data of the data warehouse for the related log information related to the digital twin, and the specific process is as follows: manually labeling related log information related to digital twin data, and then completing data attribute extraction related to the digital twin in a data warehouse by using a named body recognition technology, wherein a set of related attributes related to the digital twin can be obtained by using the method as follows:
A={A 1 ,A 2 …,A b ,…,A B }
wherein B represents the total number of extracted attributes of the digital twins in the data warehouse, and A represents the total number of extracted attributes of the digital twins in the data warehouse b Representing the b-th attribute.
(2) And calculating attribute weights of all attributes.
All pairs in the extracted data warehouse are relatedThe objectivity and accuracy of the digital twin body data are different in the process of describing the physical entity by the extracted attributes of the digital twin body, and the quality of the big data in the data warehouse is different from that of the digital twin body, so that the objectivity and accuracy of the digital twin body data with different attributes need to be quantified b For example, its attribute weight w b The calculation formula of (2) is as follows:
Figure BDA0003966278000000081
in the formula: n is b Is the b-th attribute A b The number of times that the data is present in the data warehouse,
Figure BDA0003966278000000082
the total number of occurrences for all attributes.
The significance of quantifying the weight of the attribute in the data in the above manner is that the more times the attribute appears, the more objective and accurate the attribute is proved, namely w b The greater the time, the greater the attribute A of the twin is b The more objective and accurate the description of the relevant physical entities in the data warehouse, the higher the importance degree of the description, and the larger the weight value.
The above method is used to calculate the attribute weight of all the extracted attributes, and B attribute weights corresponding to the B attributes can be obtained.
(3) And obtaining the partition quality label of each time sequence uniform partition according to the attribute weight.
After obtaining the attribute weights of all attributes, because each interval in the time sequence uniform partition contains big data counted in different days, the attributes in each piece of data in the big data of each day are not necessarily the same, and the different attribute weights are different, the partition label of each time sequence uniform partition is quantized for the attributes in all the big data in each time sequence uniform partition, taking the h time sequence uniform partition as an example, the partition quality label calculation formula of the h time sequence uniform partition is as follows:
Figure BDA0003966278000000083
in the formula, C h The h time sequence is divided into the partition quality labels of the uniform partitions; t denotes the t day in the h time-ordered uniform partition, and t e [1,T];ΔA t Representing attribute volatility of the t day in the h time-ordered uniform partition relative to the attribute of the previous day;
Figure BDA0003966278000000084
representing the number of the b-th attribute appearing at the t day in the h time sequence uniform partition; w is a b Representing the attribute weight corresponding to the b-th attribute;
wherein, delta A t The calculation formula of (2) is as follows:
Figure BDA0003966278000000091
in the formula,. DELTA.A t Representing attribute volatility of the t day in the h time-series uniform partition relative to the previous day; b' t B' th attribute among attributes representing attributes appearing at day t in the h-th time-series uniform partition but not appearing at day t-1
Figure BDA0003966278000000092
Figure BDA0003966278000000093
Represents the b' th attribute that appears at day t but does not appear at day t-1 in the h time-ordered uniform partition
Figure BDA0003966278000000094
The corresponding number; w is a b Is the b-th attribute A b The attribute weight of (2); b' t B "th attribute A, which indicates that it does not occur in the t-th day but occurs in the t-1 th day in the h-th time-series uniform partition b”
Figure BDA0003966278000000095
B "th attribute A, which indicates that it does not occur in the t-th day but occurs in the t-1 th day in the h-th time-series uniform partition b” The number of (2);
in the formula,. DELTA.A t The calculation process is calculated according to the fluctuation of the attribute corresponding to the data in the h-th time sequence uniform partition, the specific quantification mode is obtained by the difference of the attributes appearing in the continuous day and the attribute weight and the number of the attributes with the difference, the larger the difference of the attributes appearing in the continuous day is, the stronger the fluctuation of the attribute corresponding to the data is, and the stronger the fluctuation of the attribute is, the poorer the quality of the data corresponding to the attribute in the day is. Then, negating by using a negative exponential function of e, when the fluctuation of the data is larger, the stability of the attribute corresponding to the data after negating is weaker, otherwise, the data is negated, then, the number of the attributes appearing in the time sequence uniform partition is multiplied by the corresponding weight value to serve as the label of the data in the current partition, the number of the attributes in the partition is larger, the corresponding weight value is larger,
Figure BDA0003966278000000096
the larger the size, the better the quality of the data in the partition, so the product of the two parts is used as the label C of the partition h ,C h The larger the size, the better the quality of the data in the partition, and vice versa.
The partition quality labels are calculated for all the H time sequence uniform partitions by using the above mode, the larger the partition quality label is, the more attributes with larger attribute weights are contained in different big data in different days in the corresponding time sequence uniform partition are, and the more stable the attributes of the different big data contained in different days appear, the better the quality of the big data in the time sequence uniform partition is relatively, and the partition quality labels of all the time sequence uniform partitions are obtained.
(4) And acquiring a time sequence data quality curve of the whole data warehouse data according to the partition quality label.
After the partition quality labels of all time sequence uniform partitions are obtained, calculating a time sequence data quality curve of the whole data warehouse data by using the partition quality labels of all time sequence uniform partitions, wherein the specific calculation formula is as follows:
Figure BDA0003966278000000101
wherein f (h) represents a time series data quality curve of the data of the whole data warehouse, C h A partition quality label for h time sequence uniform partition; h is the h-th time sequence uniform partition, and h is the [1,H ]](ii) a H is the total number of all time sequence uniform partitions; e is a natural constant.
f (h) represents a time series data quality curve for the entire data warehouse data, the time series data quality curve being C for all time series uniform partitions in the data warehouse data corresponding to data in the time series uniform partitions for all big data on a time-based profitability basis h As a result of curve fitting, the theoretical support is that the longer the time for inputting large data is, the smaller the corresponding time-sequence uniform partition h is, and the lower the profitability for providing data assistance for the digital twin, such as data about surface defects of a ten-year-old screw, and the lower the influence degree of the data in the design of the ten-year-old screw in the same field, the lower the profit from the data.
Different big data at different times is of different importance to the digital twin for each data in the industrial data warehouse (i.e. timeliness of big data), in particular the less close the time used by the user, the weaker the effect, the lower the benefit, but not completely invalid, so the invention constrains it by a monotonically bounded increasing function over infinite time, i.e. by a monotonically bounded function over infinite time
Figure BDA0003966278000000102
The constraints are made such that the closer the generation time of the data is to the metric time, the higher the gain, and the closer the generation time is to the metric time, the lower the gain. The quality curve of large data at different times is represented by its product with the quality label of the partition.
Data quality curves of the digital twin with respect to all data warehouses are obtained using the above described approach.
And S004, obtaining a large data quality parameter according to the time sequence data quality curve.
After the quality curves f (h) of all big data in the industrial data warehouse are obtained, the big data quality curves are used for carrying out quality measurement on the big data, specifically, the big data quality parameters D are calculated by using the big data quality curves, and the big data quality parameter calculation formula is as follows:
Figure BDA0003966278000000103
in the formula, D is a big data quality parameter in data warehouse data; h is the h time sequence uniform partition; f (h) time series data quality curve of the whole data warehouse data; dh is infinitesimal; h is the total number of uniform partitions in all cases.
Formula logic: the yield curve of the big data in the industrial data warehouse can not be clearly and intuitively measured on the quality of the big data, the data statistics time in the data warehouse is different, and the quality of the big data can be influenced during quantification, so that the quality of the big data is measured by integrating the quality curve of the big data and then calculating the average value, and the influence of the statistics time on the quality of the big data is smaller in a more intuitive angle. The larger the quality parameter D, the better the quality of the big data in the data warehouse, and vice versa.
And S005, re-optimizing the digital twin system and establishing a digital twin model according to the data warehouse data with the maximum large data quality parameter.
In the full life cycle of the digital twin system, after the large data quality parameters of all data warehouse data are obtained, the data warehouse data with the largest large data quality parameter are selected to carry out re-optimization on the digital twin system and establishment of a digital twin model, the re-optimization of the digital twin system can be modeling iterative upgrade on the digital twin body by using the data warehouse data with the largest large data quality parameter and can be used for training and optimizing an industry mechanism model in the digital twin system, the optimization process is not the protection focus of the invention and is not described in detail.
In conclusion, the quality measurement of the big data is carried out by utilizing the characteristic attributes in the data, so that the quality measurement of the big data can be carried out under the condition of greatly reducing the calculation amount, and high-quality data support is provided for the optimization of the digital twin system. The method is based on the partitioned time sequence type calculation of the data, can greatly weaken the influence of time on the data quality measurement, and is higher in accuracy.
It should be noted that: the precedence order of the above embodiments of the present invention is only for description, and does not represent the merits of the embodiments. And specific embodiments thereof have been described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments.
The above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; the modifications or substitutions do not make the essence of the corresponding technical solutions deviate from the technical solutions of the embodiments of the present application, and are included in the protection scope of the present application.

Claims (9)

1. A big data based data quality metric method, the method comprising:
s1: acquiring digital twin data in an industrial data warehouse;
s2: respectively taking different time intervals as unit time nodes, obtaining the data volume size difference corresponding to each time node according to the data volume difference of adjacent time, and obtaining time sequence uniform partitions according to the data volume size difference;
s3: acquiring data attributes in a data warehouse, and calculating attribute weights of all the attributes;
s4: obtaining attribute volatility of each attribute relative to the previous day according to the number of each attribute in each time sequence uniform partition and the attribute weight corresponding to each attribute;
s5, obtaining a label of the data in each time sequence uniform partition according to the number of each attribute in each time sequence uniform partition and the attribute weight corresponding to each attribute; obtaining a partition quality label of each time sequence uniform partition according to the label and attribute fluctuation of the data in each time sequence uniform partition, and obtaining a time sequence data quality curve of the data of the whole data warehouse according to the partition quality label;
s6: obtaining a big data quality parameter according to the time sequence data quality curve;
s7: and selecting corresponding optimal data warehouse data according to the big data quality parameters to re-optimize the digital twin system and establish a digital twin model.
2. The big-data-based data quality measurement method according to claim 1, wherein the step of obtaining the data size difference corresponding to each time node comprises:
the difference calculation formula of the corresponding data quantity of the node by taking one day as a time node is as follows:
Figure FDA0003966277990000011
in the formula, h 1 Indicating that the day is the h-th in a time node 1 A time node, wherein H 1 The number of the maximum nodes taking one day as a time node;
Figure FDA0003966277990000012
denotes the h th 1 The amount of data in the data warehouse in each time node;
Figure FDA0003966277990000013
denotes the h th 1 -the amount of data in the data warehouse in 1 time node;
data volume size difference D with two days as one time node 2 And the analogy is carried out to obtain the T epsilon [1, delta T ∈]And is and
Figure FDA0003966277990000014
MAXT represents the number of days of existence of all data in the industrial data warehouse;
when T days are taken as a time node, the corresponding data volume difference D T And when the minimum time is less than the preset time, uniformly partitioning the data in all the data warehouses by taking every T days as a time node.
3. The big-data-based data quality metric method according to claim 1, wherein the step of obtaining data attributes in a data warehouse comprises:
the attribute extraction of data related to the digital twin is completed in a data warehouse by using a named body identification technology, and the collection of related attributes A related to the digital twin is obtained as follows:
A={A 1 ,A 2 …,A b ,…,A B }
in the formula, B represents the total number of the attributes extracted by the relevant digital twin in the data warehouse; a. The b Representing the b-th attribute.
4. The big-data-based data quality measurement method according to claim 3, wherein the step of calculating the attribute weights of all the attributes comprises:
for the b-th attribute A b Its attribute weight w b Is attribute A b The ratio of the number of times of occurrence in the data warehouse to the total number of attributes extracted by the digital twins in the data warehouse;
and calculating attribute weights of all attributes.
5. The big-data-based data quality measurement method according to claim 1, wherein the step of obtaining attribute volatility of each attribute with respect to a previous day according to the number of occurrences of each attribute in each time-sequential uniform partition and the attribute weight corresponding to each attribute comprises:
the calculation formula of the attribute fluctuation of the t day in the h time sequence uniform partition relative to the attribute fluctuation of the previous day is as follows:
Figure FDA0003966277990000021
in the formula,. DELTA.A t Representing attribute volatility of the t day in the h time-ordered uniform partition relative to the attribute of the previous day; b' t B' th attribute representing an attribute that appears at day t in the h-th time-sequential uniform partition but does not appear at day t-1
Figure FDA0003966277990000022
Figure FDA0003966277990000023
Indicating that the ith time-ordered uniform partition occurred on day t but was inThe b' th attribute not appearing on day t-1
Figure FDA0003966277990000024
The corresponding number; w is a b Is the b-th attribute A b The attribute weight of (2); b' t B "th attribute A, which indicates that it does not occur in the t-th day but occurs in the t-1 th day in the h-th time-series uniform partition b”
Figure FDA0003966277990000025
B "th attribute A, which indicates that it does not occur in the t-th day but occurs in the t-1 th day in the h-th time-series uniform partition b” Exp () represents an exponential function with a natural constant as the base.
6. The big-data-based data quality metric method according to claim 1, wherein the obtaining of the partition quality label of each time-sequential uniform partition comprises:
for the h time sequence uniform partition, the partition quality label calculation formula of the h time sequence uniform partition is as follows:
Figure FDA0003966277990000031
in the formula, C h The h time sequence is divided into the partition quality labels of the uniform partitions; t represents the t day in the h time-series uniform partition; t represents the total number of days in the h time-ordered uniform partition; delta A t Representing attribute volatility of the t day in the h time-series uniform partition relative to the previous day;
Figure FDA0003966277990000032
the number of the b-th attribute appearing at the t day in the h time sequence uniform partition is represented as
Figure FDA0003966277990000033
w b Representing the attribute weight corresponding to the b-th attribute;
and calculating the partition quality labels of all time sequence uniform partitions.
7. The big-data-based data quality metric method as claimed in claim 1, wherein the time-series data quality curve of the whole data warehouse data is obtained according to a monotone increasing model constructed by each time-series uniform partition and the partition quality label thereof.
8. The big-data-based data quality metric method of claim 1, wherein the big-data quality parameter obtained from the time-series data quality curve is obtained by integrating the time-series data quality curve.
9. The big-data-based data quality metric method as claimed in claim 1, wherein the step of re-optimizing the digital twin system and establishing the digital twin model by selecting the corresponding best data warehouse data according to the big-data quality parameter comprises:
calculating large data quality parameters in all data warehouses related to the whole life cycle generated by the digital twin system, and selecting the data warehouse data with the maximum quality parameter to re-optimize the digital twin system and establish a digital twin model.
CN202211499047.8A 2022-11-28 2022-11-28 Data quality measurement method based on big data Active CN115718744B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211499047.8A CN115718744B (en) 2022-11-28 2022-11-28 Data quality measurement method based on big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211499047.8A CN115718744B (en) 2022-11-28 2022-11-28 Data quality measurement method based on big data

Publications (2)

Publication Number Publication Date
CN115718744A true CN115718744A (en) 2023-02-28
CN115718744B CN115718744B (en) 2023-07-21

Family

ID=85256651

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211499047.8A Active CN115718744B (en) 2022-11-28 2022-11-28 Data quality measurement method based on big data

Country Status (1)

Country Link
CN (1) CN115718744B (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111562769A (en) * 2019-02-14 2020-08-21 罗克韦尔自动化技术公司 AI extension and intelligent model validation for industrial digital twinning
US20200342333A1 (en) * 2017-03-23 2020-10-29 Asml Netherlands B.V. Methods of modelling systems or performing predictive maintenance of systems, such as lithographic systems and associated lithographic systems
WO2020227429A1 (en) * 2019-05-06 2020-11-12 Strong Force Iot Portfolio 2016, Llc Platform for facilitating development of intelligence in an industrial internet of things system
CN112367109A (en) * 2020-09-28 2021-02-12 西北工业大学 Incentive method for digital twin-driven federal learning in air-ground network
CN113742431A (en) * 2021-08-13 2021-12-03 太原向明智控科技有限公司 Method and system for managing working surface measurement data
US20220036301A1 (en) * 2019-11-05 2022-02-03 Strong Force Vcn Portfolio 2019, Llc Internet of things resources for control tower and enterprise management platform
US20220059238A1 (en) * 2020-08-24 2022-02-24 GE Precision Healthcare LLC Systems and methods for generating data quality indices for patients
CN114528284A (en) * 2022-02-18 2022-05-24 广东电网有限责任公司 Bottom layer data cleaning method and device, mobile terminal and storage medium
CN114548509A (en) * 2022-01-18 2022-05-27 湖南大学 Multi-type load joint prediction method and system for multi-energy system
US20220207206A1 (en) * 2020-12-24 2022-06-30 Beijing Institute Of Technology Physical Digital Twin Modeling Method And Apparatus For Assembly, Electronic Device And Medium
CN114691654A (en) * 2020-12-28 2022-07-01 丰田自动车株式会社 Data processing method and data processing system in energy Internet
CN114968984A (en) * 2022-06-11 2022-08-30 上海起杭数字科技有限公司 Digital twin full life cycle management platform

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200342333A1 (en) * 2017-03-23 2020-10-29 Asml Netherlands B.V. Methods of modelling systems or performing predictive maintenance of systems, such as lithographic systems and associated lithographic systems
CN111562769A (en) * 2019-02-14 2020-08-21 罗克韦尔自动化技术公司 AI extension and intelligent model validation for industrial digital twinning
WO2020227429A1 (en) * 2019-05-06 2020-11-12 Strong Force Iot Portfolio 2016, Llc Platform for facilitating development of intelligence in an industrial internet of things system
US20220036301A1 (en) * 2019-11-05 2022-02-03 Strong Force Vcn Portfolio 2019, Llc Internet of things resources for control tower and enterprise management platform
US20220059238A1 (en) * 2020-08-24 2022-02-24 GE Precision Healthcare LLC Systems and methods for generating data quality indices for patients
CN112367109A (en) * 2020-09-28 2021-02-12 西北工业大学 Incentive method for digital twin-driven federal learning in air-ground network
US20220207206A1 (en) * 2020-12-24 2022-06-30 Beijing Institute Of Technology Physical Digital Twin Modeling Method And Apparatus For Assembly, Electronic Device And Medium
CN114691654A (en) * 2020-12-28 2022-07-01 丰田自动车株式会社 Data processing method and data processing system in energy Internet
CN113742431A (en) * 2021-08-13 2021-12-03 太原向明智控科技有限公司 Method and system for managing working surface measurement data
CN114548509A (en) * 2022-01-18 2022-05-27 湖南大学 Multi-type load joint prediction method and system for multi-energy system
CN114528284A (en) * 2022-02-18 2022-05-24 广东电网有限责任公司 Bottom layer data cleaning method and device, mobile terminal and storage medium
CN114968984A (en) * 2022-06-11 2022-08-30 上海起杭数字科技有限公司 Digital twin full life cycle management platform

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
HARI SHANKAR GOVINDASAMY 等: "Air Quality Management: An Exemplar for Model-Driven Digital Twin Engineering", 《2021 ACM/IEEE INTERNATIONAL CONFERENCE ON MODEL DRIVEN ENGINEERING LANGUAGES AND SYSTEMS COMPANION (MODELS-C)》, pages 229 - 232 *
RAFAEL TEIXEIRA 等: "time sequence AND digital twin AND measure AND data quality", 《SAC \'22: PROCEEDINGS OF THE 37TH ACM/SIGAPP SYMPOSIUM ON APPLIED COMPUTING》, pages 191 - 197 *
孟冠军 等: "基于孪生数据的产品装配过程质量预测模型", 《组合机床与自动化加工技术》, pages 126 - 129 *
李立雪 等: "基于时序数据库的数字孪生模型动态数据建模", 《2022年中国航空工业技术装备工程协会年会论文集》, pages 274 - 278 *

Also Published As

Publication number Publication date
CN115718744B (en) 2023-07-21

Similar Documents

Publication Publication Date Title
JP2017126158A (en) Binary classification learning device, binary classification device, method, and program
CN112529638B (en) Service demand dynamic prediction method and system based on user classification and deep learning
CN113111924A (en) Electric power customer classification method and device
CN113807900A (en) RF order demand prediction method based on Bayesian optimization
CN116523320A (en) Intellectual property risk intelligent analysis method based on Internet big data
CN112289391A (en) Anode aluminum foil performance prediction system based on machine learning
CN111209469A (en) Personalized recommendation method and device, computer equipment and storage medium
CN115983622A (en) Risk early warning method of internal control cooperative management system
Han et al. Online inference with debiased stochastic gradient descent
CN117131449A (en) Data management-oriented anomaly identification method and system with propagation learning capability
CN115718744A (en) Data quality measurement method based on big data
CN113203953A (en) Lithium battery residual service life prediction method based on improved extreme learning machine
CN114820074A (en) Target user group prediction model construction method based on machine learning
CN114757495A (en) Membership value quantitative evaluation method based on logistic regression
CN112465054A (en) Multivariate time series data classification method based on FCN
CN112348275A (en) Regional ecological environment change prediction method based on online incremental learning
Bystrov et al. Choosing the Number of Topics in LDA Models--A Monte Carlo Comparison of Selection Criteria
CN114117251B (en) Intelligent context-Bo-down fusion multi-factor matrix decomposition personalized recommendation method
CN113837266B (en) Software defect prediction method based on feature extraction and Stacking ensemble learning
CN117792404B (en) Data management method for aluminum alloy die-casting part
CN111027021A (en) Multi-dimensional and multi-weight price prediction method
Yousuf et al. Digital Data Forgetting: A Machine Learning Approach
CN117216524A (en) Experimental data analysis and autonomous training method based on artificial intelligence
CN114692886A (en) Model training method and system, and storage medium
CN117633309A (en) Data processing method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant