CN115718744B - Data quality measurement method based on big data - Google Patents

Data quality measurement method based on big data Download PDF

Info

Publication number
CN115718744B
CN115718744B CN202211499047.8A CN202211499047A CN115718744B CN 115718744 B CN115718744 B CN 115718744B CN 202211499047 A CN202211499047 A CN 202211499047A CN 115718744 B CN115718744 B CN 115718744B
Authority
CN
China
Prior art keywords
data
attribute
time
quality
partition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211499047.8A
Other languages
Chinese (zh)
Other versions
CN115718744A (en
Inventor
杨道平
胡礼波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhonghang Lutong Technology Co ltd
Original Assignee
Beijing Zhonghang Lutong Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhonghang Lutong Technology Co ltd filed Critical Beijing Zhonghang Lutong Technology Co ltd
Priority to CN202211499047.8A priority Critical patent/CN115718744B/en
Publication of CN115718744A publication Critical patent/CN115718744A/en
Application granted granted Critical
Publication of CN115718744B publication Critical patent/CN115718744B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to the field of digital twinning, in particular to a data quality measurement method based on big data, which comprises the following steps: twin data in an industrial data warehouse are acquired and data preprocessing is carried out; acquiring the data size difference corresponding to each time node, and carrying out time sequence uniform partitioning on twin big data according to the data size difference; acquiring data attributes in an industrial data warehouse, calculating attribute weights of all the attributes, acquiring partition quality labels of each time sequence uniform partition according to the attribute weights, and acquiring time sequence data quality curves of the whole data warehouse according to the partition quality labels; obtaining big data quality parameters according to the time sequence data quality curve; and finally, selecting data warehouse data in different fields in the optimal full life cycle corresponding to the digital twin body according to the large data quality parameters, and optimizing the digital twin system. Compared with the existing method for measuring the data quality, the method improves accuracy of twin big data quality measurement.

Description

Data quality measurement method based on big data
Technical Field
The invention relates to the field of digital twinning, in particular to a data quality measurement method based on big data.
Background
Along with the development of technology, the information age has come, and digital twin technology is vigorously developed under the promotion of technology, and is widely applied to various industries in the industrial field. The digital twin technology is a very important technology for the field of industrial manufacturing, and a digital twin model is established for corresponding industrial physical entities by using the digital twin technology, so that the process and the behavior of the entities in a physical field can be comprehensively described, mapped, monitored, diagnosed and optimized. And the establishment of the corresponding digital twin model requires the support by utilizing big data of multiple dimensions. The quality of big data is often important to the establishment of a digital twin model, under the condition that the quality of big data of a certain dimension is poor, the digital twin model established according to the big data is often inaccurate, certain deviation can be caused to the description, mapping, monitoring, diagnosis and optimization of the entity, the process and the behavior of a certain dimension in a physical field, and under the condition that the quality of the big data is good, the established digital twin model is often more accurate, the negative influence cannot occur, so that the high-quality big data is an important supporting point for the establishment of the digital twin system optimization and the digital twin model.
In the prior art, when the quality of big data is measured, the whole integrity and timeliness of the big data are often measured, and the measurement mode has certain advantages for the whole big data, but is often not objective and accurate enough when the digital twin system is optimized through twin big data. When the digital twin system is optimized later, a large data set which is unsuitable for selection is easily selected due to insufficient accuracy and objectivity of data quality, so that the digital twin system optimization and the digital twin model establishment are greatly influenced.
According to the invention, on the basis of carrying out time sequence data quantity uniform partition on the twin big data, digital twin attribute data in the big data are extracted, then quality measurement of the big data is carried out on the extracted attribute data in different partitions, then the quality measurement of the whole twin big data is realized through the quality of the big data of different partitions, finally the twin big data are selected according to measurement results, and twin big data with higher quality are selected to carry out digital twin system optimization and digital twin model establishment.
Disclosure of Invention
In order to solve the above problems, the present invention provides a data quality measurement method based on big data, the method comprising:
s1: acquiring digital twin data in an industrial data warehouse;
s2: respectively taking different time intervals as unit time nodes, obtaining the data volume difference corresponding to each time node according to the data volume difference of adjacent time, and obtaining time sequence uniform partitions according to the data volume difference;
s3: acquiring data attributes in a data warehouse, and calculating attribute weights of all the attributes;
s4: obtaining attribute volatility of each attribute relative to the previous day according to the occurrence number of each attribute in each time sequence uniform partition and the attribute weight corresponding to each attribute;
s5, obtaining labels of data in each time sequence uniform partition according to the number of each attribute in each time sequence uniform partition and the attribute weight corresponding to each attribute; obtaining a partition quality label of each time sequence uniform partition according to the label and attribute fluctuation of the data in each time sequence uniform partition, and obtaining a time sequence data quality curve of the whole data warehouse data according to the partition quality label;
s6: obtaining big data quality parameters according to the time sequence data quality curve;
s7: and selecting corresponding optimal data warehouse data according to the big data quality parameters to re-optimize the digital twin system and establishing a digital twin model.
Preferably, the step of obtaining the data size difference corresponding to each time node and obtaining the time sequence uniform partition according to the data size difference includes:
the size difference calculation formula of the data volume corresponding to the node by taking one day as a time node is as follows:
in the formula, h 1 Representing the h in a time of day node 1 Time node, where H 1 The number of maximum nodes taking one day as one time node;represents the h 1 The amount of data in the data warehouse in each time node; />Represents the h 1 -the amount of data in the data warehouse in 1 time node;
data size difference D using two days as a time node 2 And so on to obtain T E [1, Δt ]]And (2) andMAXT represents the number of days of existence of all data in the industrial data warehouse;
when T days are taken as a time node, the corresponding data size is different D T At the minimum, every T days is one hourThe inter-nodes uniformly partition the data in all the data warehouses.
Preferably, the step of acquiring the data attribute in the data warehouse includes:
the attribute extraction of the data related to the digital twin is completed in the data warehouse by using a named body recognition technology, and the collection of the related attributes A related to the digital twin is obtained as follows:
A={A 1 ,A 2 …,A b ,…,A B }
wherein B represents the total number of the attributes extracted from the related digital twin in the data warehouse; a is that b Representing the b-th attribute.
Preferably, the step of calculating attribute weights of all attributes includes:
for attribute b A b Its attribute weight w b Is attribute A b The ratio of the number of occurrences in the data warehouse to the total number of attributes extracted from the digital twins in the data warehouse;
and calculating attribute weights of all the attributes.
Preferably, the step of obtaining the attribute volatility of each attribute with respect to the previous day according to the number of occurrences of each attribute in each time sequence uniform partition and the attribute weight corresponding to each attribute includes:
the calculation formula of the attribute of the t day in the h time sequence uniform partition relative to the attribute fluctuation of the previous day is as follows:
wherein DeltaA t Representing attribute volatility of the t-th day relative to the former day in the h-th time-series uniform partition; b' t Attribute b' representing attribute appearing on the t-th day but not appearing on the t-1 th day in the h-th time-series uniform partition B' th attribute +_representing occurrence of the t-th day but not occurrence of the t-1 th day in the h-th timing uniform section>A corresponding number; w (w) b For attribute b A b Attribute weights of (a); b' t Represents the b' th attribute A which does not appear on the t-th day but appears on the t-1 th day in the h time-series uniform partition b” ;/>Represents the b' th attribute A which does not appear on the t-th day but appears on the t-1 th day in the h time-series uniform partition b” Exp () represents an exponential function based on a natural constant.
Preferably, the step of obtaining the partition quality label of each time sequence uniform partition includes:
for the h time sequence uniform partition, the partition quality label calculation formula of the h time sequence uniform partition is as follows:
wherein C is h The h time sequence is used for uniformly partitioning the partition quality labels; t represents the t-th day in the h-th time-series uniform partition; t represents the total number of days in the h time-series uniform partition; ΔA t Representing attribute volatility of the t-th day relative to the former day in the h-th time-series uniform partition;the number of the b-th attribute showing the occurrence of the t-th day in the h-th time-series uniform partition is +.>w b Representing attribute weight corresponding to the b-th attribute;
and calculating partition quality labels of all time sequence uniform partitions.
Preferably, the time sequence data quality curve of the whole data warehouse data is obtained according to a monotonically increasing model constructed by each time sequence uniform partition and the partition quality label thereof.
Preferably, the obtaining the big data quality parameter according to the time sequence data quality curve is obtained by integrating the time sequence data quality curve.
Preferably, the step of selecting the corresponding optimal data warehouse data according to the big data quality parameter to re-optimize the digital twin system and establish the digital twin model includes:
and calculating large data quality parameters in all relevant data warehouses in the whole life cycle generated by the digital twin system, selecting data warehouse data with the largest quality parameters to re-optimize the digital twin system, and establishing a digital twin model.
The embodiment of the invention has the following beneficial effects:
1. the benefit of this application over the prior art is: the quality of the entire data is measured not by using all the data but by using the characteristic attribute in the data, and the quality of the large data can be measured with a large amount of reduction in the calculation amount.
2. The benefit of this application over the prior art is: the method can utilize the related data generated in the operation process of the digital twin system to carry out data measurement, and has higher accuracy when the data measurement is compared with the prior art. The method is based on the partition time sequence calculation of the data, and can greatly weaken the influence of time on the data quality measurement, so that the subsequent digital twin system optimization provides high-quality data support.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions and advantages of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are only some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart illustrating a data quality measurement method based on big data according to an embodiment of the present invention.
Detailed Description
In order to further describe the technical means and effects adopted by the present invention to achieve the preset purpose, the following detailed description refers to the specific implementation, structure, characteristics and effects of a data quality measurement method based on big data according to the present invention with reference to the accompanying drawings and preferred embodiments. In the following description, different "one embodiment" or "another embodiment" means that the embodiments are not necessarily the same. Furthermore, the particular features, structures, or characteristics of one or more embodiments may be combined in any suitable manner.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
The following specifically describes a specific scheme of the data quality measurement method based on big data provided by the invention with reference to the accompanying drawings.
Referring to fig. 1, a flowchart illustrating a data quality measurement method based on big data according to an embodiment of the present invention is shown, the method includes the following steps:
s001, acquiring an industrial data warehouse related to the digital twin body, and preprocessing data in the data warehouse.
Because the quality measurement of big data is needed according to the data in the digital warehouse, the storage of the industrial data warehouse is needed according to the digital twin data generated by the data twin system, and the structuring processing is carried out through the storage data in the data warehouse, and the data structure needed in the embodiment is as follows: the statistical time + data entry of the data warehouse, for example, after the log information of the digital twin is imported into the data warehouse, the generation time of the log data is the statistical time, the specification of the vocabulary in the data warehouse is the data entry of the data warehouse, the log information of the digital twin comprises the log data generated by the digital twin system in the whole life cycle of industrial product design, process, production, operation and maintenance, and the like, and the digital twin data can be the data generated by any link in the whole life cycle, for example, the data generated by the industrial product design link is one of the digital twin data. The above structure is used for processing all acquired big data in the industrial data warehouse so as to measure the quality of the big data later, and the embodiment is implemented by taking one data warehouse as a research object.
And S002, acquiring the data volume difference corresponding to different time nodes, and carrying out time sequence uniform partition on the big data according to the data volume difference.
First, time sequence uniform partitioning of equal data quantity based on time sequence is carried out on big data in an industrial data warehouse, so that influence of the data quantity size on the big data in the industrial data warehouse is reduced.
Among quality metrics of large data, data size is an important influencing factor, and the influence of data sizes with different sizes on the measurement of large data is larger, and the larger the data size is, the larger the quality of large data is correspondingly spread, and the sizes of data sizes at different continuous time points in time sequence are different, so that the larger influence is caused when the quality metrics of data are carried out. The invention utilizes the data quantity in the time nodes which are uniformly distributed to partition the big data in the whole industrial data warehouse, so that the data quantity in each interval is based on the same size, and the time span length of each interval is equal, thereby reducing the influence on the data quality measurement caused by different data quantity in the follow-up. The data in the data warehouse is counted by taking the day as a unit, so the time node of the time sequence disclosed by the invention takes the day as a node, and the data of the data warehouse is partitioned by the time node which is uniformly distributed, and the specific process is as follows:
first, data size difference D of nodes is carried out by taking one day as a time node 1 Is calculated by (1):
wherein: h is a 1 Representing the h in a time of day node 1 Time node H 1 To get H in total when one day is taken as a time node 1 And a time node.Represents the h 1 The amount of data in the data warehouse in each time node,represents the h 1 -the amount of data in the data warehouse in 1 time node.
Wherein D is 1 The larger the data quantity in the data warehouse is quantified by taking one day as a time node, the larger the difference value of the data quantity is, the larger the influence is on the subsequent data quality measurement is, because the larger the data quantity is, the stronger the diversity of the attribute is, and the higher the frequency of single attribute is; when the data amount is too small, the weaker the diversity of the attributes, the lower the frequency of occurrence of a single attribute. If the difference of the data amount per day is larger, the frequency of occurrence of different attributes is different, the quality measurement for each interval will be inaccurate in the following, and vice versa.
D, taking two days as a time node to conduct data size difference of each node 2 Is calculated by (1):
wherein: h is a 2 Representing the h in a time node with two days as one time node 2 Time node H 2 To maximize the number of nodes with two days as one time node,represents the h 2 Data amount in data warehouse in each time node, < >>Represents the h 2 -the amount of data in the data warehouse in 1 time node.
D 2 The larger the data amount in the data warehouse is quantified by taking 2 days as a time node, the larger the difference value of the data amount is, and the larger the influence is on the subsequent data quality measurement, and the opposite is.
In the same way, toCalculation D Δt Obtain the set { D } 1 ,D 2 ,…,D T ,…,D Δt [1, Δt ]]And D is T Is minimal.
To this end, the MAXT is divided into H sequential uniform partitions in units of T, and the sequential uniform partition refers to partitioning all data by taking the same number of days T as a time node, and after the data is partitioned by the time node, the data amount in each partition is basically equal, so the sequential uniform partition is called.
S003, acquiring data attributes in a data warehouse, calculating attribute weights of all the attributes, acquiring partition quality labels of each time sequence uniform partition according to the attribute weights, and acquiring time sequence data quality curves of the whole data warehouse data according to the partition quality labels.
The quality of big data is relatively speaking, so firstly, the digital twin volume data is extracted with the attribute, then the attribute weight is calculated by the attribute, the stability of the attribute in each time sequence uniform partition is calculated by the attribute weight and the occurrence frequency of the attribute, the quality label of the big data in each time sequence uniform partition is carried out by the stability of the attribute, the occurrence frequency of different attributes in the big data and the attribute weight corresponding to the attribute, then the quality curve of the big data in the industrial data warehouse is obtained by the quality label of the big data in each time sequence uniform partition, and finally the quality measurement of the big data is carried out by the quality curve, and the specific process is as follows:
(1) Data attributes in a data warehouse are obtained.
The attribute extraction process of the data refers to a process of extracting keywords from related log information related to a digital twin in the data of a data warehouse, and the specific process is as follows: manually marking the related log information of the related digital twin data, and then completing the data attribute extraction of the related digital twin in a data warehouse by using a named body recognition technology, wherein the method can be used for obtaining the set of the related attributes of the related digital twin as follows:
A={A 1 ,A 2 …,A b ,…,A B }
wherein B represents the total number of B and A of the extracted attributes of the related digital twin in the data warehouse b Representing the b-th attribute.
(2) And calculating attribute weights of all the attributes.
All attributes of the related digital twin in the data warehouse are extracted, since the extracted attributes of the related digital twin are different in objectivity and accuracy in describing the physical entity, and different objectivity and accuracy are different in mass of big data in the data warehouse relative to the digital twin, the objectivity and accuracy of the digital twin data of different attributes need to be quantified, and the manner of quantifying the accuracy and objectivity of the different attributes is to calculate the attribute weight by using the different attributes, so that the b-th attribute A b For example, its attribute weight w b The calculation formula of (2) is as follows:
wherein: n is n b For attribute b A b The number of occurrences in the data warehouse,for the total number of occurrences of all attributes.
Wherein the quantification of the weight of the attribute in the data is performed in the above manner, in the sense that the more times the attribute appears, the more objective and more accurate the attribute is proved, namely w b The larger the property A of the twins is b The more objective and accurate the description of the relevant physical entity in the data warehouse, the higher the importance degree, and the larger the weight value correspondingly.
And calculating the attribute weights of all the extracted attributes by using the mode, so that B attribute weights corresponding to the B attributes can be obtained.
(3) And obtaining the partition quality label of each time sequence uniform partition according to the attribute weight.
After the attribute weights of all the attributes are obtained, since each interval in the time sequence uniform partition contains big data counted in different days, the attributes in each piece of data in the big data of each day are not necessarily the same, and the different attribute weights are different, the partition labels of each time sequence uniform partition are quantized for the attributes in all the big data in each time sequence uniform partition, taking the h time sequence uniform partition as an example, and the partition quality label calculation formula of the h time sequence uniform partition is as follows:
wherein C is h The h time sequence is used for uniformly partitioning the partition quality labels; t represents the t-th day in the h-th time-sequence uniform partition, and t is [1, T ]];ΔA t Representing attribute volatility of the t-th day relative to the former day in the h-th time-series uniform partition;representing the number of b-th attributes appearing on the t-th day in the h time sequence uniform partition; w (w) b Representing attribute weight corresponding to the b-th attribute;
wherein DeltaA t The calculation formula of (2) is as follows:
wherein DeltaA t Representing attribute volatility of the t-th day relative to the former day in the h-th time-series uniform partition; b' t Represents the b' th attribute among attributes appearing on the t-th day but not appearing on the t-1 th day in the h-th time-series uniform partition B' th attribute +_representing occurrence of the t-th day but not occurrence of the t-1 th day in the h-th timing uniform section>A corresponding number; w (w) b For attribute b A b Attribute weights of (a); b' t Represents the b' th attribute A which does not appear on the t-th day but appears on the t-1 th day in the h time-series uniform partition b” ;/>Represents the b' th attribute A which does not appear on the t-th day but appears on the t-1 th day in the h time-series uniform partition b” Is the number of (3);
ΔA in the formula t The calculation process is calculated by the fluctuation of the attribute corresponding to the data in the h time sequence uniform partition, and the specific quantization mode is that the difference of the attribute appearing in the continuous days and the attribute weight and the number of the attribute with the difference are obtained, the larger the difference of the attribute appearing in the continuous days is, the stronger the fluctuation of the attribute corresponding to the data is, and the stronger the fluctuation of the attribute is, the poorer the quality of the data corresponding to the attribute in the continuous days is. Then, the negative exponential function of e is used for inverting, and when the fluctuation of the data is larger, the inversion is performedAnd then the weaker the stability of the attribute corresponding to the data is, the opposite is, and then the label of the data in the current partition is obtained by taking the product of the quantity of all the attribute occurrences in the s-th time sequence uniform partition and the corresponding weight, the more the quantity of the attribute in the partition is, the larger the corresponding weight is,the larger the data quality in the partition is, the better the product of the two parts is used as the label C of the partition h ,C h The larger the quality of the data within the partition, the better the quality, and vice versa.
The method is used for calculating the partition quality labels of all H time sequence uniform partitions, the larger the partition quality labels are, the more the corresponding attributes with larger attribute weights are contained in different big data in different days in the time sequence uniform partitions, and the more the attributes of different big data contained in different days appear stably, the better the quality of the big data in the time sequence uniform partitions is relatively, and the partition quality labels of all the time sequence uniform partitions are obtained.
(4) And acquiring a time sequence data quality curve of the whole data warehouse data according to the partition quality label.
After all the partition quality labels of the time sequence uniform partition are obtained, the time sequence data quality curve of the whole data warehouse data is calculated by using the partition quality labels of all the time sequence uniform partition, and the specific calculation formula is as follows:
wherein f (h) represents a time-series data quality curve of the whole data warehouse data, C h Partition quality labels for h time sequence uniform partitions; h is the h time sequence uniform partition, and h is E [1, H]The method comprises the steps of carrying out a first treatment on the surface of the H is the total number of all sequential uniform partitions; e is a natural constant.
f (h) represents a time series data quality curve of the whole data warehouse data, and the time series data qualityThe curve is C for all time-sequential uniform partitions in data warehouse data corresponding to data in time-sequential uniform partitions for all large data on a time-based profitability basis h As a result of curve fitting, the longer the entry time of the theoretical support is, the smaller the corresponding time sequence uniform partition h is, the lower the benefit of providing data assistance for the current relevant digital twin body, for example, the data about surface defects of the screw ten years ago, the lower the influence degree of the data is in the design of the screw in the same field today ten years later, and the lower the benefit from the data is.
In the case of each data in an industrial data warehouse, the different big data at different times are of different importance to the digital twin (i.e. the timeliness of the big data), in particular the weaker the time effect used by the user the less the benefit, but not the complete ineffectiveness, so the invention constrains it by a monotonic bounded increasing function over an infinite period of time, i.e. byConstraints are made such that the closer the generation time of the data is to the metric time, the higher the benefit thereof, the less close the benefit thereof is to the metric time. The quality curve of big data at different times is represented by its product with the quality label of the partition.
A data quality curve for all data warehouses associated with the digital twin is obtained using the above-described approach.
S004, obtaining large data quality parameters according to the time sequence data quality curve.
After the quality curves f (h) of all the big data in the industrial data warehouse are obtained, the big data quality curves are used for carrying out quality measurement of the big data, specifically, the big data quality parameters D are calculated by using the big data quality curves, and the big data quality parameters have the calculation formula:
wherein D is a big data quality parameter in data warehouse data; h is the h time sequence uniform partition; f (h) a time-series data quality curve of the whole data warehouse data; dh is a infinitesimal; h is the total number of all time-uniform partitions.
Formula logic: the profit curve of big data in the industrial data warehouse can not clearly and intuitively measure the quality of the big data, and the quality of the big data can be influenced when the statistics time of the data in the data warehouse is different, so the quality of the big data is measured by integrating the quality curve of the big data and then taking an average value, and the influence of the statistics time on the quality of the big data is smaller in a more intuitive angle. The larger the quality parameter D, the better the quality of the big data in the data warehouse, and vice versa.
S005, re-optimizing the digital twin system and establishing a digital twin model according to the data warehouse data with the maximum big data quality parameters.
And in the full life cycle of the digital twin system operation, after the big data quality parameters of all the data warehouse data are acquired, the data warehouse data with the largest big data quality parameters are selected to re-optimize the digital twin system and establish a digital twin model, wherein the re-optimization of the digital twin system can be realized by modeling and iterative upgrading of the digital twin body by utilizing the data warehouse data with the largest big data quality parameters and training and optimizing of an industry mechanism model in the digital twin system, and the optimization process is not a protection key point of the invention and is not described in detail.
In summary, the invention can perform quality measurement of big data by utilizing the characteristic attribute in the data, and can provide high-quality data support for optimizing the digital twin system under the condition of greatly reducing the calculation amount. The method is based on the data partition time sequence calculation, can greatly weaken the influence of time on the data quality measurement, and has higher accuracy.
It should be noted that: the sequence of the embodiments of the present invention is only for description, and does not represent the advantages and disadvantages of the embodiments. And the foregoing description has been directed to specific embodiments of this specification. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments.
The above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the scope of the embodiments of the present application, and are intended to be included within the scope of the present application.

Claims (8)

1. A data quality metric method based on big data, the method comprising:
s1: acquiring digital twin data in an industrial data warehouse;
s2: respectively taking different time intervals as unit time nodes, obtaining the data volume difference corresponding to each time node according to the data volume difference of adjacent time, and obtaining time sequence uniform partitions according to the data volume difference;
s3: acquiring data attributes in a data warehouse, and calculating attribute weights of all the attributes;
s4: obtaining attribute volatility of each attribute relative to the previous day according to the occurrence number of each attribute in each time sequence uniform partition and the attribute weight corresponding to each attribute;
s5, obtaining labels of data in each time sequence uniform partition according to the number of each attribute in each time sequence uniform partition and the attribute weight corresponding to each attribute; obtaining a partition quality label of each time sequence uniform partition according to the label and attribute fluctuation of the data in each time sequence uniform partition, and obtaining a time sequence data quality curve of the whole data warehouse according to the partition quality label;
s6: obtaining big data quality parameters according to the time sequence data quality curve;
s7: selecting corresponding optimal data warehouse data according to the big data quality parameters, re-optimizing the digital twin system and establishing a digital twin model;
the step of obtaining the data size difference corresponding to each time node and obtaining the time sequence uniform partition according to the data size difference comprises the following steps:
the size difference calculation formula of the data volume corresponding to the node by taking one day as a time node is as follows:
in the formula, h 1 Representing the h in a time of day node 1 Time node, where H 1 The number of maximum nodes taking one day as one time node;represents the h 1 The amount of data in the data warehouse in each time node; />Represents the h 1 -the amount of data in the data warehouse in 1 time node;
data size difference D using two days as a time node 2 And so on to obtain T E [1, Δt ]]And (2) andMAXT represents the number of days of existence of all big data in the industrial data warehouse;
when T days are taken as a time node, the corresponding data size is different D T And at the minimum, uniformly partitioning the data in all the data warehouses by taking every T days as a time node.
2. The big data based data quality metric method of claim 1, wherein the step of obtaining data attributes in the data warehouse comprises:
the attribute extraction of the data related to the digital twin is completed in the data warehouse by using a named body recognition technology, and the collection of the related attributes A related to the digital twin is obtained as follows:
A={A 1 ,A 2 …,A b ,…,A B }
wherein B represents the total number of the attributes extracted from the related digital twin in the data warehouse; a is that b Representing the b-th attribute.
3. The big data based data quality metric method of claim 2, wherein the step of calculating the attribute weights of all attributes comprises:
for attribute b A b Its attribute weight w b Is attribute A b The ratio of the number of occurrences in the data warehouse to the total number of attributes extracted from the digital twins in the data warehouse;
and calculating attribute weights of all the attributes.
4. The method for measuring data quality based on big data according to claim 1, wherein the step of obtaining the attribute volatility of each attribute with respect to the previous day according to the number of occurrences of each attribute in each time sequence uniform partition and the attribute weight corresponding to each attribute comprises:
the calculation formula of the attribute of the t day in the h time sequence uniform partition relative to the attribute fluctuation of the previous day is as follows:
wherein DeltaA t Representing attribute volatility of the t-th day relative to the former day in the h-th time-series uniform partition; b' t Attribute b' representing attribute appearing on the t-th day but not appearing on the t-1 th day in the h-th time-series uniform partition B' th attribute +_representing occurrence of the t-th day but not occurrence of the t-1 th day in the h-th timing uniform section>A corresponding number; w (w) b For attribute b A b Attribute weights of (a); b' t Represents the b' th attribute A which does not appear on the t-th day but appears on the t-1 th day in the h time-series uniform partition b” ;/>Represents the b' th attribute A which does not appear on the t-th day but appears on the t-1 th day in the h time-series uniform partition b” Exp () represents an exponential function based on a natural constant.
5. The method for measuring data quality based on big data according to claim 1, wherein the step of obtaining the partition quality label of each time-series uniform partition comprises:
for the h time sequence uniform partition, the partition quality label calculation formula of the h time sequence uniform partition is as follows:
wherein C is h The h time sequence is used for uniformly partitioning the partition quality labels; t represents the t-th day in the h-th time-series uniform partition; t represents the total number of days in the h time-series uniform partition; ΔA t Representing attribute volatility of the t-th day relative to the former day in the h-th time-series uniform partition;the number of the b-th attribute showing the occurrence of the t-th day in the h-th time-series uniform partition is +.>w b Representing attribute weight corresponding to the b-th attribute;
and calculating partition quality labels of all time sequence uniform partitions.
6. The big data based data quality metric method of claim 1, wherein the time series data quality curve of the whole data warehouse is obtained according to a monotonically increasing model constructed by each time series uniform partition and its partition quality label.
7. The method of claim 1, wherein the obtaining the big data quality parameter from the time series data quality curve is obtained by integrating the time series data quality curve.
8. The method of claim 1, wherein the steps of selecting the corresponding best data warehouse data for re-optimizing the digital twinning system and establishing the digital twinning model according to the big data quality parameters comprise:
and calculating large data quality parameters in all relevant data warehouses in the whole life cycle generated by the digital twin system, selecting data warehouse data with the largest quality parameters to re-optimize the digital twin system, and establishing a digital twin model.
CN202211499047.8A 2022-11-28 2022-11-28 Data quality measurement method based on big data Active CN115718744B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211499047.8A CN115718744B (en) 2022-11-28 2022-11-28 Data quality measurement method based on big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211499047.8A CN115718744B (en) 2022-11-28 2022-11-28 Data quality measurement method based on big data

Publications (2)

Publication Number Publication Date
CN115718744A CN115718744A (en) 2023-02-28
CN115718744B true CN115718744B (en) 2023-07-21

Family

ID=85256651

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211499047.8A Active CN115718744B (en) 2022-11-28 2022-11-28 Data quality measurement method based on big data

Country Status (1)

Country Link
CN (1) CN115718744B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114528284A (en) * 2022-02-18 2022-05-24 广东电网有限责任公司 Bottom layer data cleaning method and device, mobile terminal and storage medium
CN114691654A (en) * 2020-12-28 2022-07-01 丰田自动车株式会社 Data processing method and data processing system in energy Internet

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3379356A1 (en) * 2017-03-23 2018-09-26 ASML Netherlands B.V. Method of modelling lithographic systems for performing predictive maintenance
US11403541B2 (en) * 2019-02-14 2022-08-02 Rockwell Automation Technologies, Inc. AI extensions and intelligent model validation for an industrial digital twin
JP2022531919A (en) * 2019-05-06 2022-07-12 ストロング フォース アイオーティ ポートフォリオ 2016,エルエルシー A platform to accelerate the development of intelligence in the Internet of Things industrial system
JP2023500378A (en) * 2019-11-05 2023-01-05 ストロング フォース ヴィーシーエヌ ポートフォリオ 2019,エルエルシー Control tower and enterprise management platform for value chain networks
US20220059238A1 (en) * 2020-08-24 2022-02-24 GE Precision Healthcare LLC Systems and methods for generating data quality indices for patients
CN112367109B (en) * 2020-09-28 2022-02-01 西北工业大学 Incentive method for digital twin-driven federal learning in air-ground network
CN112699504B (en) * 2020-12-24 2023-05-05 北京理工大学 Assembly physical digital twin modeling method and device, electronic equipment and medium
CN113742431A (en) * 2021-08-13 2021-12-03 太原向明智控科技有限公司 Method and system for managing working surface measurement data
CN114548509A (en) * 2022-01-18 2022-05-27 湖南大学 Multi-type load joint prediction method and system for multi-energy system
CN114968984A (en) * 2022-06-11 2022-08-30 上海起杭数字科技有限公司 Digital twin full life cycle management platform

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114691654A (en) * 2020-12-28 2022-07-01 丰田自动车株式会社 Data processing method and data processing system in energy Internet
CN114528284A (en) * 2022-02-18 2022-05-24 广东电网有限责任公司 Bottom layer data cleaning method and device, mobile terminal and storage medium

Also Published As

Publication number Publication date
CN115718744A (en) 2023-02-28

Similar Documents

Publication Publication Date Title
Siami-Namini et al. The performance of LSTM and BiLSTM in forecasting time series
Siami-Namini et al. A comparative analysis of forecasting financial time series using arima, lstm, and bilstm
US8990145B2 (en) Probabilistic data mining model comparison
CN107563645A (en) A kind of Financial Risk Analysis method based on big data
CN113962294B (en) Multi-type event prediction model
WO2018133596A1 (en) Continuous feature construction method based on nominal attribute
CN113011191A (en) Knowledge joint extraction model training method
CN112819523A (en) Marketing prediction method combining inner/outer product feature interaction and Bayesian neural network
CN117131449A (en) Data management-oriented anomaly identification method and system with propagation learning capability
CN116861076A (en) Sequence recommendation method and device based on user popularity preference
CN115392477A (en) Skyline query cardinality estimation method and device based on deep learning
CN114662652A (en) Expert recommendation method based on multi-mode information learning
CN115718744B (en) Data quality measurement method based on big data
CN113435632A (en) Information generation method and device, electronic equipment and computer readable medium
Bidyuk et al. Methodology of Constructing Statistical Models for Nonlinear Non-stationary Processes in Medical Diagnostic Systems.
CN114820074A (en) Target user group prediction model construction method based on machine learning
CN115409541A (en) Cigarette brand data processing method based on data blood relationship
CN114357284A (en) Crowdsourcing task personalized recommendation method and system based on deep learning
CN112667394A (en) Computer resource utilization rate optimization method
WO2021077097A1 (en) Systems and methods for training generative models using summary statistics and other constraints
CN112488411A (en) Processing stability evaluation method based on approximate period process
CN112465054A (en) Multivariate time series data classification method based on FCN
US20230106295A1 (en) System and method for deriving a performance metric of an artificial intelligence (ai) model
CN114117251B (en) Intelligent context-Bo-down fusion multi-factor matrix decomposition personalized recommendation method
CN114386196B (en) Method for evaluating mechanical property prediction accuracy of plate strip

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant