CN112231314A - Quality data evaluation method based on ETL - Google Patents
Quality data evaluation method based on ETL Download PDFInfo
- Publication number
- CN112231314A CN112231314A CN202011225244.1A CN202011225244A CN112231314A CN 112231314 A CN112231314 A CN 112231314A CN 202011225244 A CN202011225244 A CN 202011225244A CN 112231314 A CN112231314 A CN 112231314A
- Authority
- CN
- China
- Prior art keywords
- evaluation
- dominance
- random
- service provider
- dimension
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/254—Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2209/00—Indexing scheme relating to G06F9/00
- G06F2209/48—Indexing scheme relating to G06F9/48
- G06F2209/484—Precedence
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Quality & Reliability (AREA)
- Complex Calculations (AREA)
Abstract
The invention provides a quality data evaluation method based on ETL, which comprises the steps of distributing ETL tasks to an actuator by using a greedy scheduling algorithm in batches, adding the tasks to a Quartz scheduler for timed execution, calling a remote actuator interface by an abstract Quartz task, and starting a local thread to run the ETL tasks when the actuator receives the execution calling of a new task. And calculating task priority by using a high response ratio priority scheduling algorithm, and acquiring thread resources to run the ETL task by the task distributed to the actuator according to the priority. And the ETL is used for predicting and adding the service quality evaluation scale label, so that the influence of subjective factors is avoided to a great extent.
Description
[ technical field ] A method for producing a semiconductor device
The invention relates to the technical field of quality evaluation, in particular to an ETL-based quality data evaluation method.
[ background of the invention ]
The ETL (transformation and loading, extraction) technology is a basic technology for building a data warehouse and is also a basic technology for batch data exchange. ETL is the process of extracting, converting, integrating, cleansing, and loading data from a source to a target. The ETL process is an important link for constructing a data warehouse, and even accounts for 80% of the whole construction process. As time goes by, ETL tasks in the data warehouse increase, and the amount of data also continues to increase. If the system runs on a single machine, the task execution takes a lot of time, and the timeliness and the availability of data are severely limited. And allocating tasks according to the data volume of the ETL task data source, so that the load of the working nodes in the cluster is balanced, and the total task execution time is shortest. And the priority of the tasks is dynamically adjusted by using a high-response-ratio priority scheduling algorithm, so that the tasks are guaranteed to be executed fairly on the nodes.
With the development of quality evaluation technology, quality evaluation of service information has been widely applied in many fields, for example, in the technical field of medical image fusion, medical image fusion is beneficial to a doctor to analyze the state of illness of a patient, and the diagnosis rate is improved; the fusion of the infrared image and the visible light image can clearly reflect the information under the complex environment condition, so that the infrared image and the visible light image can be applied to the military field; the multi-focus image fusion can effectively improve the imaging quality of the camera, improve the scene identification accuracy and eliminate redundant information among data. How to use the ETL for effective objective quality evaluation and combine the effective objective quality evaluation with subjective evaluation becomes an urgent problem to be solved.
[ summary of the invention ]
The invention provides a quality data evaluation method based on ETL.
The technical scheme adopted by the invention is as follows:
a quality data assessment method based on ETL comprises the steps of 1, obtaining quality evaluation data and conducting data preprocessing, and calculating a probability distribution function, an accumulative distribution function and an expectation matrix of a service provider according to evaluation indexes of different dimensions and based on the obtained evaluation data; and (3) batch ETL tasks, allocating the ETL tasks to an actuator by using a greedy scheduling algorithm, adding the tasks to a Quartz scheduler for timed execution, calling a remote actuator interface by the abstract Quartz task, and starting a local thread to run the ETL tasks when the actuator receives the execution calling of a new task. Calculating task priority by using a high-response-ratio priority scheduling algorithm, and acquiring thread resources to run an ETL (extract transform load) task by the task distributed to the actuator according to the priority;
the greedy allocation algorithm has the following specific steps:
step 101, sorting the task set S from large to small according to data volume, and storing the task set S into a queue A. A = { (si, wi) … (sj, wj) }, wherein wi is not less than wj, 1 is not less than i, and j is not less than n;
and 102, taking out the task at the head of the queue from the queue A, calculating the total amount of the data volume of the task of each node, and distributing the task to ui, wherein i is more than or equal to 1 and less than or equal to m. ui is the node with the minimum total data volume in the current working nodes;
step 103, repeating the step 2 until the queue is empty, and finishing the algorithm;
step 2, obtaining a random dominance relationship matrix between every two service providers of the same type under each evaluation dimension by using a random dominance criterion;
step 3, determining a random dominance degree matrix through a dominance degree function;
step 4, clustering the random dominance degree matrixes under each evaluation dimension by using entropy weight;
and 5, calculating the ranking values of all the evaluation dimensions and the overall service quality.
Further, acquiring quality evaluation data and performing data preprocessing, specifically comprising:
step 101, obtaining quality evaluation data from an evaluation database;
and 102, performing off-line clustering on the acquired evaluation data according to the structural similarity, analyzing and deleting irrelevant parts in the similarity structure, and inputting the evaluation quantitative value of the quality evaluation data calculated based on a subtractive clustering algorithm into a recurrent neural network model to predict and add a service quality evaluation scale label.
Further, the evaluation quantitative value of the evaluation data is calculated using the following formula based on a subtractive clustering algorithm:
wherein the content of the first and second substances,is the evaluation quantitative value of the data point zk of the evaluation data calculated at time k.
Further, the calculating a probability distribution function, a cumulative distribution function, and an expectation matrix of the service provider specifically includes:
determination in evaluation dimensionS j Is given belowH i The service quality evaluation scale ofT ε The number of users, i.e.
Wherein the content of the first and second substances,
indicating that the xth user is about the service providerH i Evaluation dimension ofS j The evaluation scale used for the quality evaluation was performed,i∈I,j∈J,x∈{1,2,…,N ij },ε∈{1,2,…,v},η∈{1,2,…,v};
then, the evaluation dimension is calculatedS j Is given belowH i The service quality evaluation scale ofT ε Probability of (2)I.e. by
calculating the service provider by the formulaH i With respect to evaluation dimensionS j Evaluation scale ofT ε Cumulative distribution function ofF ij (t)I.e. by
In addition, a service provider is constructedH i For evaluation dimensionS j Desired matrix of evaluation scale ofWherein
Further, a random dominance relationship matrix between every two service providers of the same type under each evaluation dimension is obtained by using a random dominance criterion, and the random dominance relationship matrix specifically comprises the following steps:
when the random dominance criterion is used for service provider quality of service assessment, F (x) and G (x) are respectively service providersH i AndH h cumulative distribution function of the rating scale for the rating dimension, then when (x) takes random advantage over g (x), it indicates that the service provider is in this rating dimensionH i Random dominanceH h ;
Establishing a target evaluation dimension according to the above random dominance criterionS j Random dominance relationship matrix between two service providersWhereinRepresenting a target evaluation dimensionS j Service providerH i AndH h random dominance relationship between, i.e.
WhereinRepresenting a target evaluation dimensionS j Service providerH i First order random dominanceH h ,Representing a target evaluation dimensionS j Service provider HiSecond order random dominanceH h ,Representing a target evaluation dimensionS j Service providerH i Third order random dominanceH h ,Representing a target evaluation dimensionS j Service providerH i AndH h there is no random dominance relationship between them.
Further, the random dominance criterion specifically includes:
assuming that X and Y are random variables in the interval [ a, b ] and the cumulative distribution functions are F (X) and G (X), respectively, the random dominance criterion is expressed as follows:
first-order random dominance: if and only if F (x) ≠ G (x), and H1(x) = F (x) -G (x) ≦ 0,x ∈ [ a, b ], is called F (x) first-order random dominance over G (x), and is denoted as F (x) FSDG (x);
second-order random dominance: if and only if F (x) ≠ fG (x), and,x ∈ [ a, b ], is called F (x) second-order random dominance over G (x), denoted F (x) SSDG (x);
third-order random dominance: if and only if F (x) ≠ G (x), and,x ∈ [ a, b ], F (x) is designated as a third-order random dominance over G (x), denoted as F (x) TSDG (x).
Further, the determining a random dominance degree matrix through a dominance degree function specifically includes:
based on the determined random dominance relation matrix, the evaluation dimensionality can be further described by utilizing the dominance functionS j Service providerH i AndH h random dominance degree between;
based on the obtained random dominance relationship matrix, the process of determining the random dominance degree matrix through the dominance degree function is as follows:
wherein the content of the first and second substances,a j the preference threshold value for the evaluation dimension can be calculated by the expected difference value of the service provider for the evaluation dimension, and the calculation formula is
Wherein the content of the first and second substances,
greater values of (A) indicate a preference for evaluation dimensionsS j Service providerH i Is superior toH h The higher the degree of (b), otherwise the lower;
construction of evaluation dimension based on the methodS j Random dominance matrix between two service providers,i,h∈I,i≠h,j∈J。
Further, the clustering the random dominance degree matrix under each evaluation dimension by using the entropy weight specifically includes:
Then, determining an entropy weight of the evaluation dimension, wherein the information entropy is defined as:
where m is the number of evaluation targets.
Further, determining the entropy weight of the evaluation dimension specifically includes:
step 401, computing the output entropy, i.e.
Step 402, calculating the degree of difference of the evaluation dimensions, namely
Step 403, calculating the entropy weight of the evaluation dimension, i.e.
Finally, the dominance degree matrixes under all the evaluation dimensions are aggregated by utilizing the entropy weight, and the overall dominance degree matrixes of all the service providers are constructed,i,h∈I,i≠h,j∈J(ii) a Wherein the content of the first and second substances,for service providersH i To service providerH h The greater the value of (A), the higher the degree of representation, anIs calculated by the formula
Further, the calculating of the ranking values of the evaluation dimensions and the overall service quality specifically includes:
calculating a service provider based on the priority matrixAndand then obtaining the ranking value according to the matrixService provider targeting evaluation dimensions can be computedS j Is/are as followsAndthe calculation formula is
Wherein the content of the first and second substances,representing a target evaluation dimensionS j Service providerH i Superior to the overall credibility of other investigated service providers,the larger the representation of the service providerH i Evaluation dimension ofS j The higher the level of (a) is,representing a target evaluation dimensionS j Service providerH i Inferior to the overall credibility of other investigated service providers,the smaller, the representative service providerH i Evaluation dimension ofS j The higher the level of (c);
according toAndcomputing service providerH i For evaluation dimensionS j Rank value ofThe calculation formula is as follows:
the obtained ranking value can be used for evaluating dimensionalityS j Ordering the service providers of the same type; according to the matrixComputing service providersH i Is/are as followsAndthe calculation formula is
Wherein the content of the first and second substances,representing service providersH i Superior to the overall trustworthiness of other service providers of the same type,the larger the representation of the service providerH i The higher the overall quality of service of (a),representing service providersH i Inferior to the overall credibility of other service providers of the same type,the smaller, the representative service providerH i The higher the overall quality of service;
according toAndcomputable service providerH i The overall ranking value is calculated by the formula
And then the service quality of the service providers of the same type can be totally sequenced according to the obtained sequencing value.
Through the embodiment of the invention, the following technical effects can be obtained: and the ETL is used for predicting and adding the service quality evaluation scale label, so that the influence of subjective factors is avoided to a great extent.
[ description of the drawings ]
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed in the embodiments or the prior art descriptions will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without inventive labor.
FIG. 1 is a schematic diagram of an ETL-based cluster scheduling process according to the present invention;
FIG. 2 is a schematic flow chart of the greedy allocation algorithm of the present invention.
[ detailed description ] embodiments
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
Fig. 1 is a schematic diagram of an ETL-based cluster scheduling process according to the present invention. The ETL scheduling system is composed of a scheduling module and an execution module. The scheduling module (scheduling center) is responsible for managing scheduling information, abstracting all scheduling tasks into one task, sending out scheduling requests according to scheduling configuration, and not bearing service codes. The scheduling system is decoupled from the specific tasks, and the availability and stability of the system are improved. The execution module (executor) is responsible for receiving the scheduling request and executing the task logic.
The scheduling process is shown in fig. 1. And (3) batch ETL tasks, allocating the ETL tasks to an actuator by using a greedy scheduling algorithm, adding the tasks to a Quartz scheduler for timed execution, calling a remote actuator interface by the abstract Quartz task, and starting a local thread to run the ETL tasks when the actuator receives the execution calling of a new task. And calculating task priority by using a high response ratio priority scheduling algorithm, and acquiring thread resources to run the ETL task by the task distributed to the actuator according to the priority.
A greedy algorithm is an approximation algorithm that solves the optimal solution. When solving the problem, the choice that seems best at the present time is always made. That is, rather than considering global optimality, a locally optimal solution in some sense is made.
Suppose that there is a set of tasks S = { S1, S2, …, sn }, the size W of source data amount corresponding to each task in S = { W1, W2, …, wn }, and m work nodes U = { U1, U2, …, um }. The algorithm evenly distributes the tasks in the S to the working nodes U according to the size of the source data volume, and solves the distribution method with the shortest total task execution time. It is assumed that the more average the sum of the source data amount allocated to all tasks by each machine is, the more balanced the load of hardware resources is, and the smaller the total task execution time cost is.
FIG. 2 is a schematic flow chart of the greedy allocation algorithm of the present invention. The greedy allocation algorithm has the following specific steps:
step 101, sorting the task set S from large to small according to data volume, and storing the task set S into a queue A. A = { (si, wi) … (sj, wj) }, where wi ≧ wj, 1 ≦ i, j ≦ n.
And 102, taking out the task at the head of the queue from the queue A, calculating the total amount of the data volume of the task of each node, and distributing the task to ui, wherein i is more than or equal to 1 and less than or equal to m. And ui is the node with the minimum data amount in the current working nodes.
And 103, repeating the step 2 until the queue is empty, and finishing the algorithm.
High-response-to-priority scheduling is used in an ETL scheduling system to dynamically adjust the priority of ETL tasks. The ETL operation has the characteristics of different running time, and if the priority of the ETL task is dynamically adjusted, the tasks can have an opportunity to obtain resource running.
The algorithm is applied to an executor module of a scheduling framework, which is mainly designed as shown in fig. 2. The executor continuously receives the scheduling request from the scheduling center and then caches the scheduling request to the priority queue, and under the condition that the thread group resources are sufficient, the tasks of the priority queue can be executed by threads. When the thread group resources cannot meet the requirements, some ETL tasks have higher priorities due to longer waiting time, and if idle resources exist, the high-priority tasks are preferentially acquired to the execution opportunity. In order to avoid the chance that some tasks occupy too long time and other tasks cannot be executed, the execution sequence of the tasks is adjusted by designing the dynamic priority of the tasks, a typical consumption producer model is used, the priority queue is a buffer area of a producer, a task monitoring thread is a consumer agent (responsible for task allocation), and a thread of a thread group is a consumer.
Step 1, obtaining quality evaluation data and carrying out data preprocessing, and calculating a probability distribution function, an accumulative distribution function and an expectation matrix of a service provider based on the obtained evaluation data aiming at evaluation indexes of different dimensions;
in the above step, the obtaining of the quality evaluation data and the data preprocessing specifically include:
step 101, obtaining quality evaluation data from an evaluation database. The obtaining mode can comprise directly reading from a database, or capturing corresponding evaluation data from evaluation databases of different network platforms through a web crawler;
and 102, performing off-line clustering on the acquired evaluation data according to the structural similarity, analyzing and deleting irrelevant parts in the similarity structure, and inputting the evaluation quantitative value of the quality evaluation data calculated based on a subtractive clustering algorithm into a recurrent neural network model to predict and add a service quality evaluation scale label.
Since the recurrent neural network model is constructed based on the similarity structure, data is cluster-evaluated by using the similarity structure and some keywords as features. Some keywords that often appear in the evaluation data and affect various target analyses, such as "original", "may", "but", "lingo", and the like, are preset, and the evaluation data is clustered using subtractive clustering. The evaluation data itself may be regarded as a candidate focus, and the evaluation quantitative value of each evaluation data is calculated by the following formula:
wherein the content of the first and second substances,are data pointsz i The potential value of (a) is,Nis the number of data points that are,r a is a normal number. From the above, it can be seen that the evaluation quantitative value of each evaluation data isA function of the distance to all other data points. The point with the highest evaluated quantization value is selected as the cluster center. The evaluation quantification values of the data points of all the evaluation data gradually decrease, the decrease value depending on their distance to the cluster center. The next cluster center is the data point of the ratings data with the largest rating quantization value remaining, and so on. To cluster the evaluation data, an evaluation quantitative value of the evaluation data is calculated using the following formula based on a subtractive clustering algorithm:
wherein the content of the first and second substances,is at time ofkData points of the calculated evaluation dataz k The evaluation quantitative value of (1).
The calculating of the probability distribution function, the cumulative distribution function and the expectation matrix of the service provider specifically includes:
determination in evaluation dimensionS j Is given belowH i The service quality evaluation scale ofT ε The number of users, i.e.
Wherein the content of the first and second substances,
indicating that the xth user is about the service providerH i Evaluation dimension ofS j The evaluation scale used for the quality evaluation was performed,i∈I,j∈J,x∈{1,2,…,N ij },ε∈{1,2,…,v},η∈{1,2,…,v}。
then, the evaluation dimension is calculatedS j Is given belowH i The service quality evaluation scale ofT ε Probability of (2)I.e. by
calculating the service provider by the formulaH i With respect to evaluation dimensionS j Evaluation scale ofT ε Cumulative distribution function ofF ij (t)I.e. by
In addition, a service provider is constructedH i For evaluation dimensionS j Desired matrix of evaluation scale ofWherein
Step 2, obtaining a random dominance relationship matrix between every two service providers of the same type under each evaluation dimension by using a random dominance criterion;
assuming that X and Y are random variables in the interval [ a, b ] and the cumulative distribution functions are F (X) and G (X), respectively, the random dominance criterion is expressed as follows:
first-order random dominance: if and only if F (x) ≠ G (x), and H1(x) = F (x) -G (x) ≦ 0,x ∈ [ a, b ], is called F (x) first-order random dominance over G (x), and is denoted as F (x) FSDG (x);
second-order random dominance: if and only if F (x) ≠ G (x), and,x ∈ [ a, b ], is called F (x) second-order random dominance over G (x), denoted F (x) SSDG (x);
third-order random dominance: if and only if F (x) ≠ G (x), and,x ∈ [ a, b ], is called F (x) the third-order random dominance is better than G (x), and is marked as F (x) TSDG (x);
when the random dominance criterion is used for service provider quality of service assessment, F (x) and G (x) are respectively service providersH i AndH h cumulative distribution function of the rating scale for the rating dimension, then when (x) takes random advantage over g (x), it indicates that the service provider is in this rating dimensionH i Random dominanceH h 。
The evaluation dimension can be established according to the random dominance criterionS j Random dominance relationship matrix between two service providersWhereinRepresenting a target evaluation dimensionS j Service providerH i AndH h random dominance relationship between, i.e.
WhereinRepresenting a target evaluation dimensionS j Service providerH i First order random dominanceH h ,Representing a target evaluation dimensionS j Service provider HiSecond order random dominanceH h ,Representing a target evaluation dimensionS j Service providerH i Third order random dominanceH h ,Representing a target evaluation dimensionS j Service providerH i AndH h there is no random dominance relationship between them.
Step 3, determining a random dominance degree matrix through a dominance degree function;
based on the determined random dominance relation matrix, the evaluation dimensionality can be further described by utilizing the dominance functionS j Service providerH i AndH h random dominance degree in between. Based on random occupations obtainedThe process of determining the random dominance degree matrix through the dominance degree function is as follows:
wherein the content of the first and second substances,a j the preference threshold value for the evaluation dimension can be calculated by the expected difference value of the service provider for the evaluation dimension, and the calculation formula is
Wherein the content of the first and second substances,
greater values of (A) indicate a preference for evaluation dimensionsS j Service providerH i Is superior toH h The higher the degree of (c), otherwise the lower.
Based on the method, the evaluation dimension can be constructedS j Random dominance matrix between two service providers,i,h∈I,i≠h,j∈J。
Step 4, clustering the random dominance degree matrixes under each evaluation dimension by using entropy weight;
Then, the entropy weight of the evaluation dimension is determined, the information entropy is the measurement of the disorder degree of the system, the influence of subjective factors can be avoided to a certain degree, the quantization problem of information is solved, and the definition of the information entropy is as follows:
where m is the number of evaluation targets. Entropy is a measure tool for the degree of information uncertainty using probability theory, and the more scattered the data is, the greater the uncertainty is. Generally, if the degree of variation of an index value is larger, the information entropy of the index value is smaller, the amount of information provided by the index is larger, and the weight of the index is larger; conversely, the smaller the weight. Therefore, the tool can calculate the weight 'entropy weight' of each evaluation dimension in three steps by using the information entropy.
Step 401, computing the output entropy, i.e.
Step 402, calculating the degree of difference of the evaluation dimensions, namely
Step 403, calculating the entropy weight of the evaluation dimension, i.e.
Finally, the dominance degree matrixes under all the evaluation dimensions are aggregated by utilizing the entropy weight, and the overall dominance degree matrixes of all the service providers are constructed,i,h∈I,i≠h,j∈J. Wherein the content of the first and second substances,for service providersH i To service providerH h The greater the value of (A), the higher the degree of representation, anIs calculated by the formula
Step 5, calculating the ranking value of each evaluation dimension and the overall service quality
From the priority matrix, a service provider can be calculatedAndand then the ranking value can be derived. According to a matrixService provider targeting evaluation dimensions can be computedS j Is/are as followsAndthe calculation formula is
Wherein the content of the first and second substances,representing a target evaluation dimensionS j Service providerH i Superior to the overall credibility of other investigated service providers,the larger the representation of the service providerH i Evaluation dimension ofS j The higher the level of (a) is,representing a target evaluation dimensionS j Service providerH i Inferior to the overall credibility of other investigated service providers,the smaller, the representative service providerH i Evaluation dimension ofS j The higher the level of (c).
According toAndcomputing service providerH i For evaluation dimensionS j Rank value ofThe calculation formula is as follows:
the obtained ranking value can be used for evaluating dimensionalityS j The same type of service providers are ranked. According to the matrixService provider can be calculatedH i Is/are as followsAndthe calculation formula is
Wherein the content of the first and second substances,representing service providersH i Superior to the overall trustworthiness of other service providers of the same type,the larger the representation of the service providerH i The higher the overall quality of service of (a),representing service providersH i Inferior to the overall credibility of other service providers of the same type,the smaller, the representative service providerH i The higher the overall quality of service.
According toAndcomputable service providerH i The overall ranking value is calculated by the formula
And then the service quality of the service providers of the same type can be totally sequenced according to the obtained sequencing value.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (10)
1. A quality data evaluation method based on ETL is characterized in that,
step 1, obtaining quality evaluation data and carrying out data preprocessing, and calculating a probability distribution function, an accumulative distribution function and an expectation matrix of a service provider based on the obtained evaluation data aiming at evaluation indexes of different dimensions; the ETL tasks are distributed to an actuator in batch by using a greedy scheduling algorithm, the tasks are added to a Quartz scheduler to be executed at regular time, the abstract Quartz task calls a remote actuator interface, and when the actuator receives the execution call of a new task, a local thread is started to run the ETL tasks;
calculating task priority by using a high-response-ratio priority scheduling algorithm, and acquiring thread resources to run an ETL (extract transform load) task by the task distributed to the actuator according to the priority;
the greedy allocation algorithm has the following specific steps:
step 101, sorting a task set S from large to small according to data quantity, and storing the task set S into a queue A;
a = { (si, wi) … (sj, wj) }, wherein wi is not less than wj, 1 is not less than i, and j is not less than n;
102, taking out the task at the head of the queue from the queue A, calculating the total amount of the data volume of the task of each node, and distributing the task to ui, wherein i is more than or equal to 1 and less than or equal to m;
ui is the node with the minimum total data volume in the current working nodes;
step 103, repeating the step 2 until the queue is empty, and finishing the algorithm;
step 2, obtaining a random dominance relationship matrix between every two service providers of the same type under each evaluation dimension by using a random dominance criterion;
step 3, determining a random dominance degree matrix through a dominance degree function;
step 4, clustering the random dominance degree matrixes under each evaluation dimension by using entropy weight;
and 5, calculating the ranking values of all the evaluation dimensions and the overall service quality.
2. The quality evaluation method according to claim 1, wherein the acquiring of quality evaluation data and the data preprocessing specifically comprise:
step 101, obtaining quality evaluation data from an evaluation database;
and 102, performing off-line clustering on the acquired evaluation data according to the structural similarity, analyzing and deleting irrelevant parts in the similarity structure, and inputting the evaluation quantitative value of the quality evaluation data calculated based on a subtractive clustering algorithm into a recurrent neural network model to predict and add a service quality evaluation scale label.
3. The quality evaluation method according to claim 2, wherein the evaluation quantitative value of the evaluation data is calculated using the following formula based on a subtractive clustering algorithm:
4. The quality evaluation method according to one of claims 1 to 3, wherein the calculating a probability distribution function, a cumulative distribution function, and an expectation matrix of the service provider specifically comprises:
determination in evaluation dimensionS j Is given belowH i The service quality evaluation scale ofT ε The number of users, i.e.
Wherein the content of the first and second substances,
indicating that the xth user is about the service providerH i Evaluation dimension ofS j The evaluation scale used for the quality evaluation was performed,i∈I,j∈J,x∈{1,2,…,N ij },ε∈{1,2,…,v},η∈{1,2,…,v};
then, the evaluation dimension is calculatedS j Is given belowH i The service quality evaluation scale ofT ε Probability of (2)I.e. by
calculating the service provider by the formulaH i With respect to evaluation dimensionS j Evaluation scale ofT ε Cumulative distribution function ofF ij (t)I.e. by
In addition, a service provider is constructedH i For evaluation dimensionS j Desired matrix of evaluation scale ofWherein
5. The quality evaluation method according to one of claims 1 to 4, wherein a random dominance criterion is used to obtain a random dominance relationship matrix between two service providers of the same type in each evaluation dimension, and the method specifically comprises:
when the random dominance criterion is used for service provider quality of service assessment, F (x) and G (x) are respectively service providersH i AndH h cumulative distribution function of evaluation scale for evaluation dimension, then when (x) takes random advantage over g (x)Illustrate a service provider in this evaluation dimensionH i Random dominanceH h ;
Establishing a target evaluation dimension according to the above random dominance criterionS j Random dominance relationship matrix between two service providersWhereinRepresenting a target evaluation dimensionS j Service providerH i AndH h random dominance relationship between, i.e.
WhereinRepresenting a target evaluation dimensionS j Service providerH i First order random dominanceH h ,Representing a target evaluation dimensionS j Service provider HiSecond order random dominanceH h ,Representing a target evaluation dimensionS j Service providerH i Third order random dominanceH h ,Representing a target evaluation dimensionS j Service providerH i AndH h there is no random dominance relationship between them.
6. The quality evaluation method according to one of claims 1 to 5, wherein the random dominance criterion specifically comprises:
assuming that X and Y are random variables in the interval [ a, b ] and the cumulative distribution functions are F (X) and G (X), respectively, the random dominance criterion is expressed as follows:
first-order random dominance: if and only if F (x) ≠ G (x), and H1(x) = F (x) -G (x) ≦ 0,x ∈ [ a, b ], is called F (x) first-order random dominance over G (x), and is denoted as F (x) FSDG (x);
second-order random dominance: if and only if F (x) ≠ G (x), and,x ∈ [ a, b ], is called F (x) second-order random dominance over G (x), denoted F (x) SSDG (x);
7. The quality evaluation method according to one of claims 1 to 6, wherein the determining of the random dominance degree matrix by the dominance degree function specifically comprises:
based on the determined random dominance relation matrix, the evaluation dimensionality can be further described by utilizing the dominance functionS j Service providerH i AndH h random dominance degree between;
based on the obtained random dominance relationship matrix, the process of determining the random dominance degree matrix through the dominance degree function is as follows:
wherein the content of the first and second substances,a j the preference threshold value for the evaluation dimension can be calculated by the expected difference value of the service provider for the evaluation dimension, and the calculation formula is
Wherein the content of the first and second substances,
greater values of (A) indicate a preference for evaluation dimensionsS j Service providerH i Is superior toH h The higher the degree of (b), otherwise the lower;
8. The quality evaluation method according to one of claims 1 to 7, wherein the clustering of the random dominance degree matrices under each evaluation dimension by using entropy weights specifically comprises:
Then, determining an entropy weight of the evaluation dimension, wherein the information entropy is defined as:
where m is the number of evaluation targets.
9. The quality evaluation method according to claim 8, wherein determining the entropy weight of the evaluation dimension specifically comprises:
step 401, computing the output entropy, i.e.
Step 402, calculating the degree of difference of the evaluation dimensions, namely
Step 403, calculating the entropy weight of the evaluation dimension, i.e.
Finally, the dominance degree matrixes under all the evaluation dimensions are aggregated by utilizing the entropy weight, and the overall dominance degree matrixes of all the service providers are constructed,i,h∈I,i≠h,j∈J(ii) a Wherein the content of the first and second substances,for service providersH i To service providerH h The greater the value of (A), the higher the degree of representation, anIs calculated by the formula
10. The quality evaluation method according to one of claims 1 to 9, wherein the calculating of the ranking values of each evaluation dimension and the overall quality of service specifically comprises:
calculating a service provider based on the priority matrixAndand then obtaining the ranking value according to the matrixService provider targeting evaluation dimensions can be computedS j Is/are as followsAndthe calculation formula is
Wherein the content of the first and second substances,representing a target evaluation dimensionS j Service providerH i Superior to the overall credibility of other investigated service providers,the larger the representation of the service providerH i Evaluation dimension ofS j The higher the level of (a) is,representing a target evaluation dimensionS j Service providerH i Inferior to the overall credibility of other investigated service providers,the smaller, the representative service providerH i Evaluation dimension ofS j The higher the level of (c);
according toAndcomputing service providerH i For evaluation dimensionS j Rank value ofThe calculation formula is as follows:
the obtained ranking value can be used for evaluating dimensionalityS j Ordering the service providers of the same type; according to the matrixComputing service providersH i Is/are as followsAndthe calculation formula is
Wherein the content of the first and second substances,representing service providersH i Superior to the overall trustworthiness of other service providers of the same type,the larger the representation of the service providerH i The higher the overall quality of service of (a),representing service providersH i Is inferior toThe overall trustworthiness of other service providers of the same type,the smaller, the representative service providerH i The higher the overall quality of service;
according toAndcomputable service providerH i The overall ranking value is calculated by the formula
And then the service quality of the service providers of the same type can be totally sequenced according to the obtained sequencing value.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011225244.1A CN112231314A (en) | 2020-11-05 | 2020-11-05 | Quality data evaluation method based on ETL |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011225244.1A CN112231314A (en) | 2020-11-05 | 2020-11-05 | Quality data evaluation method based on ETL |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112231314A true CN112231314A (en) | 2021-01-15 |
Family
ID=74122173
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011225244.1A Pending CN112231314A (en) | 2020-11-05 | 2020-11-05 | Quality data evaluation method based on ETL |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112231314A (en) |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101046820A (en) * | 2006-03-29 | 2007-10-03 | 国际商业机器公司 | System and method for prioritizing websites during a webcrawling process |
CN103092683A (en) * | 2011-11-07 | 2013-05-08 | Sap股份公司 | Scheduling used for analyzing data and based on elicitation method |
US20150142725A1 (en) * | 2006-12-19 | 2015-05-21 | Teradata Corporation | High-throughput extract-transform-load (etl) of program events for subsequent analysis |
CN107944702A (en) * | 2017-11-23 | 2018-04-20 | 绥化学院 | A kind of network security step analysis appraisal procedure, device and computer-readable recording medium |
US20180210931A1 (en) * | 2017-01-20 | 2018-07-26 | Bank Of America Corporation | System for analyzing the runtime impact of data files on data extraction, transformation, and loading jobs |
CN109101632A (en) * | 2018-08-15 | 2018-12-28 | 中国人民解放军海军航空大学 | Product quality abnormal data retrospective analysis method based on manufacture big data |
CN109800954A (en) * | 2018-12-19 | 2019-05-24 | 中国石油化工股份有限公司 | Evaluating reservoir new method based on log data |
CN110287245A (en) * | 2019-05-15 | 2019-09-27 | 北方工业大学 | Method and system for scheduling and executing distributed ETL (extract transform load) tasks |
CN110414751A (en) * | 2018-04-26 | 2019-11-05 | 观相科技(上海)有限公司 | A kind of hotel industry addressing evaluation system and evaluation method based on geographical location |
CN110866000A (en) * | 2019-11-20 | 2020-03-06 | 珠海格力电器股份有限公司 | Data quality evaluation method and device, electronic equipment and storage medium |
CN110995863A (en) * | 2019-12-19 | 2020-04-10 | 上海交通大学 | Data center load distribution method and system based on load demand characteristics |
CN111144701A (en) * | 2019-12-04 | 2020-05-12 | 中国电子科技集团公司第三十研究所 | ETL job scheduling resource classification evaluation method under distributed environment |
-
2020
- 2020-11-05 CN CN202011225244.1A patent/CN112231314A/en active Pending
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101046820A (en) * | 2006-03-29 | 2007-10-03 | 国际商业机器公司 | System and method for prioritizing websites during a webcrawling process |
US20150142725A1 (en) * | 2006-12-19 | 2015-05-21 | Teradata Corporation | High-throughput extract-transform-load (etl) of program events for subsequent analysis |
CN103092683A (en) * | 2011-11-07 | 2013-05-08 | Sap股份公司 | Scheduling used for analyzing data and based on elicitation method |
US20180210931A1 (en) * | 2017-01-20 | 2018-07-26 | Bank Of America Corporation | System for analyzing the runtime impact of data files on data extraction, transformation, and loading jobs |
CN107944702A (en) * | 2017-11-23 | 2018-04-20 | 绥化学院 | A kind of network security step analysis appraisal procedure, device and computer-readable recording medium |
CN110414751A (en) * | 2018-04-26 | 2019-11-05 | 观相科技(上海)有限公司 | A kind of hotel industry addressing evaluation system and evaluation method based on geographical location |
CN109101632A (en) * | 2018-08-15 | 2018-12-28 | 中国人民解放军海军航空大学 | Product quality abnormal data retrospective analysis method based on manufacture big data |
CN109800954A (en) * | 2018-12-19 | 2019-05-24 | 中国石油化工股份有限公司 | Evaluating reservoir new method based on log data |
CN110287245A (en) * | 2019-05-15 | 2019-09-27 | 北方工业大学 | Method and system for scheduling and executing distributed ETL (extract transform load) tasks |
CN110866000A (en) * | 2019-11-20 | 2020-03-06 | 珠海格力电器股份有限公司 | Data quality evaluation method and device, electronic equipment and storage medium |
CN111144701A (en) * | 2019-12-04 | 2020-05-12 | 中国电子科技集团公司第三十研究所 | ETL job scheduling resource classification evaluation method under distributed environment |
CN110995863A (en) * | 2019-12-19 | 2020-04-10 | 上海交通大学 | Data center load distribution method and system based on load demand characteristics |
Non-Patent Citations (4)
Title |
---|
李庆阳 等: "面向数据质量的ETL框架的设计与实现", 《计算机工程与设计》 * |
李磊: ""ETL任务集群调度方法"", 《计算机技术与发展》 * |
李铭洋 等: ""基于顾客在线评价信息的服务质量评价方法"", 《辽宁大学学报(哲学社会科学版)》 * |
蒋一翔 等: ""使用递归神经网络的目标依赖产品评价分析"", 《计算机工程与设计》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Ma et al. | Real-time multiple-workflow scheduling in cloud environments | |
Zhao et al. | Flexible service selection with user-specific QoS support in service-oriented architecture | |
CN111225050B (en) | Cloud computing resource allocation method and device | |
Adabi et al. | Bi-level fuzzy based advanced reservation of Cloud workflow applications on distributed Grid resources | |
CN109523237A (en) | Crowdsourcing task method for pushing and relevant apparatus based on user preference | |
WO2020186872A1 (en) | Expense optimization scheduling method for deadline constraint under cloud scientific workflow | |
CN111752706B (en) | Resource allocation method, device and storage medium | |
CN112286659A (en) | Cloud computing task tracking processing method and cloud computing system | |
CN112052092A (en) | Risk-aware edge computing task allocation method | |
CN109167806B (en) | Uncertain QoS (quality of service) perception Web service selection method based on foreground theory | |
Zhang et al. | An Approach for Web Service QoS prediction based on service using information | |
CN112231314A (en) | Quality data evaluation method based on ETL | |
Zhang et al. | Online resource allocation for reusable resources | |
CN110084507A (en) | The scientific workflow method for optimizing scheduling of perception is classified under cloud computing environment | |
Krishnan et al. | Self-adaptive PSO memetic algorithm for multi objective workflow scheduling in hybrid cloud. | |
Yu | Optimizing task schedules using an artificial immune system approach | |
CN113886086A (en) | Cloud platform computing resource allocation method, system, terminal and storage medium | |
CN113760550A (en) | Resource allocation method and resource allocation device | |
CN111598390A (en) | Server high availability evaluation method, device, equipment and readable storage medium | |
Teena et al. | Comprehensive trust based service selection model in federated cloud | |
Pang et al. | Eris: An Online Auction for Scheduling Unbiased Distributed Learning Over Edge Networks | |
CN114443258B (en) | Resource scheduling method, device, equipment and storage medium for virtual machine | |
Babu et al. | Prioritizing Cloud Infrastructure Using MCDM Algorithms | |
Paul et al. | Fuzzy based Particle Swarm Optimization Scheduling in Cloud Computing | |
Wang | An entropy decision model for selection of QoS-aware services provisioning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210115 |