CN112231314A

CN112231314A - Quality data evaluation method based on ETL

Info

Publication number: CN112231314A
Application number: CN202011225244.1A
Authority: CN
Inventors: 崔旖
Original assignee: Shenzhen Lihu Software Co ltd
Current assignee: Shenzhen Lihu Software Co ltd
Priority date: 2020-11-05
Filing date: 2020-11-05
Publication date: 2021-01-15

Abstract

The invention provides a quality data evaluation method based on ETL, which comprises the steps of distributing ETL tasks to an actuator by using a greedy scheduling algorithm in batches, adding the tasks to a Quartz scheduler for timed execution, calling a remote actuator interface by an abstract Quartz task, and starting a local thread to run the ETL tasks when the actuator receives the execution calling of a new task. And calculating task priority by using a high response ratio priority scheduling algorithm, and acquiring thread resources to run the ETL task by the task distributed to the actuator according to the priority. And the ETL is used for predicting and adding the service quality evaluation scale label, so that the influence of subjective factors is avoided to a great extent.

Description

Quality data evaluation method based on ETL

[ technical field ] A method for producing a semiconductor device

The invention relates to the technical field of quality evaluation, in particular to an ETL-based quality data evaluation method.

[ background of the invention ]

The ETL (transformation and loading, extraction) technology is a basic technology for building a data warehouse and is also a basic technology for batch data exchange. ETL is the process of extracting, converting, integrating, cleansing, and loading data from a source to a target. The ETL process is an important link for constructing a data warehouse, and even accounts for 80% of the whole construction process. As time goes by, ETL tasks in the data warehouse increase, and the amount of data also continues to increase. If the system runs on a single machine, the task execution takes a lot of time, and the timeliness and the availability of data are severely limited. And allocating tasks according to the data volume of the ETL task data source, so that the load of the working nodes in the cluster is balanced, and the total task execution time is shortest. And the priority of the tasks is dynamically adjusted by using a high-response-ratio priority scheduling algorithm, so that the tasks are guaranteed to be executed fairly on the nodes.

With the development of quality evaluation technology, quality evaluation of service information has been widely applied in many fields, for example, in the technical field of medical image fusion, medical image fusion is beneficial to a doctor to analyze the state of illness of a patient, and the diagnosis rate is improved; the fusion of the infrared image and the visible light image can clearly reflect the information under the complex environment condition, so that the infrared image and the visible light image can be applied to the military field; the multi-focus image fusion can effectively improve the imaging quality of the camera, improve the scene identification accuracy and eliminate redundant information among data. How to use the ETL for effective objective quality evaluation and combine the effective objective quality evaluation with subjective evaluation becomes an urgent problem to be solved.

[ summary of the invention ]

The invention provides a quality data evaluation method based on ETL.

The technical scheme adopted by the invention is as follows:

a quality data assessment method based on ETL comprises the steps of 1, obtaining quality evaluation data and conducting data preprocessing, and calculating a probability distribution function, an accumulative distribution function and an expectation matrix of a service provider according to evaluation indexes of different dimensions and based on the obtained evaluation data; and (3) batch ETL tasks, allocating the ETL tasks to an actuator by using a greedy scheduling algorithm, adding the tasks to a Quartz scheduler for timed execution, calling a remote actuator interface by the abstract Quartz task, and starting a local thread to run the ETL tasks when the actuator receives the execution calling of a new task. Calculating task priority by using a high-response-ratio priority scheduling algorithm, and acquiring thread resources to run an ETL (extract transform load) task by the task distributed to the actuator according to the priority;

the greedy allocation algorithm has the following specific steps:

step 101, sorting the task set S from large to small according to data volume, and storing the task set S into a queue A. A = { (si, wi) … (sj, wj) }, wherein wi is not less than wj, 1 is not less than i, and j is not less than n;

and 102, taking out the task at the head of the queue from the queue A, calculating the total amount of the data volume of the task of each node, and distributing the task to ui, wherein i is more than or equal to 1 and less than or equal to m. ui is the node with the minimum total data volume in the current working nodes;

step 103, repeating the step 2 until the queue is empty, and finishing the algorithm;

step 2, obtaining a random dominance relationship matrix between every two service providers of the same type under each evaluation dimension by using a random dominance criterion;

step 3, determining a random dominance degree matrix through a dominance degree function;

step 4, clustering the random dominance degree matrixes under each evaluation dimension by using entropy weight;

and 5, calculating the ranking values of all the evaluation dimensions and the overall service quality.

Further, acquiring quality evaluation data and performing data preprocessing, specifically comprising:

step 101, obtaining quality evaluation data from an evaluation database;

and 102, performing off-line clustering on the acquired evaluation data according to the structural similarity, analyzing and deleting irrelevant parts in the similarity structure, and inputting the evaluation quantitative value of the quality evaluation data calculated based on a subtractive clustering algorithm into a recurrent neural network model to predict and add a service quality evaluation scale label.

Further, the evaluation quantitative value of the evaluation data is calculated using the following formula based on a subtractive clustering algorithm:

wherein the content of the first and second substances,

is the evaluation quantitative value of the data point zk of the evaluation data calculated at time k.

Further, the calculating a probability distribution function, a cumulative distribution function, and an expectation matrix of the service provider specifically includes:

determination in evaluation dimensionS _jIs given belowH _iThe service quality evaluation scale ofT ^εThe number of users, i.e.

Wherein the content of the first and second substances,

indicating that the xth user is about the service providerH _iEvaluation dimension ofS _jThe evaluation scale used for the quality evaluation was performed,i∈I，j∈J，x∈{1,2,…,N _ij}，ε∈{1,2,…,v}，η∈{1,2,…,v}；

then, the evaluation dimension is calculatedS _jIs given belowH _iThe service quality evaluation scale ofT ^εProbability of (2)

I.e. by

Wherein the content of the first and second substances,

，

；

calculating the service provider by the formulaH _iWith respect to evaluation dimensionS _jEvaluation scale ofT ^εCumulative distribution function ofF _ij (t)I.e. by

In addition, a service provider is constructedH _iFor evaluation dimensionS _jDesired matrix of evaluation scale of

Wherein

。

Further, a random dominance relationship matrix between every two service providers of the same type under each evaluation dimension is obtained by using a random dominance criterion, and the random dominance relationship matrix specifically comprises the following steps:

when the random dominance criterion is used for service provider quality of service assessment, F (x) and G (x) are respectively service providersH _iAndH _hcumulative distribution function of the rating scale for the rating dimension, then when (x) takes random advantage over g (x), it indicates that the service provider is in this rating dimensionH _iRandom dominanceH _h；

Establishing a target evaluation dimension according to the above random dominance criterionS _jRandom dominance relationship matrix between two service providers

Wherein

Representing a target evaluation dimensionS _jService providerH _iAndH _hrandom dominance relationship between, i.e.

Wherein

Representing a target evaluation dimensionS _jService providerH _iFirst order random dominanceH _h，

Representing a target evaluation dimensionS _jService provider H_iSecond order random dominanceH _h，

Representing a target evaluation dimensionS _jService providerH _iThird order random dominanceH _h，

Representing a target evaluation dimensionS _jService providerH _iAndH _hthere is no random dominance relationship between them.

Further, the random dominance criterion specifically includes:

assuming that X and Y are random variables in the interval [ a, b ] and the cumulative distribution functions are F (X) and G (X), respectively, the random dominance criterion is expressed as follows:

first-order random dominance: if and only if F (x) ≠ G (x), and H1(x) = F (x) -G (x) ≦ 0,

x ∈ [ a, b ], is called F (x) first-order random dominance over G (x), and is denoted as F (x) FSDG (x);

second-order random dominance: if and only if F (x) ≠ fG (x), and

，

x ∈ [ a, b ], is called F (x) second-order random dominance over G (x), denoted F (x) SSDG (x);

third-order random dominance: if and only if F (x) ≠ G (x), and

，

x ∈ [ a, b ], F (x) is designated as a third-order random dominance over G (x), denoted as F (x) TSDG (x).

Further, the determining a random dominance degree matrix through a dominance degree function specifically includes:

based on the determined random dominance relation matrix, the evaluation dimensionality can be further described by utilizing the dominance functionS _jService providerH _iAndH _hrandom dominance degree between;

based on the obtained random dominance relationship matrix, the process of determining the random dominance degree matrix through the dominance degree function is as follows:

wherein the content of the first and second substances,a _jthe preference threshold value for the evaluation dimension can be calculated by the expected difference value of the service provider for the evaluation dimension, and the calculation formula is

Wherein the content of the first and second substances,

greater values of (A) indicate a preference for evaluation dimensionsS _jService providerH _iIs superior toH _hThe higher the degree of (b), otherwise the lower;

construction of evaluation dimension based on the methodS _jRandom dominance matrix between two service providers

,i,h∈I,i≠h,j∈J。

Further, the clustering the random dominance degree matrix under each evaluation dimension by using the entropy weight specifically includes:

will expect the matrix

Normalized by the formula

Then, determining an entropy weight of the evaluation dimension, wherein the information entropy is defined as:

where m is the number of evaluation targets.

Further, determining the entropy weight of the evaluation dimension specifically includes:

step 401, computing the output entropy, i.e.

Step 402, calculating the degree of difference of the evaluation dimensions, namely

Step 403, calculating the entropy weight of the evaluation dimension, i.e.

Finally, the dominance degree matrixes under all the evaluation dimensions are aggregated by utilizing the entropy weight, and the overall dominance degree matrixes of all the service providers are constructed

，i,h∈I,i≠h,j∈J(ii) a Wherein the content of the first and second substances,

for service providersH _iTo service providerH _hThe greater the value of (A), the higher the degree of representation, an

Is calculated by the formula

。

Further, the calculating of the ranking values of the evaluation dimensions and the overall service quality specifically includes:

calculating a service provider based on the priority matrix

And

and then obtaining the ranking value according to the matrix

Service provider targeting evaluation dimensions can be computedS _jIs/are as follows

And

the calculation formula is

Wherein the content of the first and second substances,

representing a target evaluation dimensionS _jService providerH _iSuperior to the overall credibility of other investigated service providers,

the larger the representation of the service providerH _iEvaluation dimension ofS _jThe higher the level of (a) is,

representing a target evaluation dimensionS _jService providerH _iInferior to the overall credibility of other investigated service providers,

the smaller, the representative service providerH _iEvaluation dimension ofS _jThe higher the level of (c);

according to

And

computing service providerH _iFor evaluation dimensionS _jRank value of

The calculation formula is as follows:

the obtained ranking value can be used for evaluating dimensionalityS _jOrdering the service providers of the same type; according to the matrix

Computing service providersH _iIs/are as follows

And

the calculation formula is

Wherein the content of the first and second substances,

representing service providersH _iSuperior to the overall trustworthiness of other service providers of the same type,

the larger the representation of the service providerH _iThe higher the overall quality of service of (a),

representing service providersH _iInferior to the overall credibility of other service providers of the same type,

the smaller, the representative service providerH _iThe higher the overall quality of service;

according to

And

computable service providerH _iThe overall ranking value is calculated by the formula

And then the service quality of the service providers of the same type can be totally sequenced according to the obtained sequencing value.

Through the embodiment of the invention, the following technical effects can be obtained: and the ETL is used for predicting and adding the service quality evaluation scale label, so that the influence of subjective factors is avoided to a great extent.

[ description of the drawings ]

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed in the embodiments or the prior art descriptions will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without inventive labor.

FIG. 1 is a schematic diagram of an ETL-based cluster scheduling process according to the present invention;

FIG. 2 is a schematic flow chart of the greedy allocation algorithm of the present invention.

[ detailed description ] embodiments

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

Fig. 1 is a schematic diagram of an ETL-based cluster scheduling process according to the present invention. The ETL scheduling system is composed of a scheduling module and an execution module. The scheduling module (scheduling center) is responsible for managing scheduling information, abstracting all scheduling tasks into one task, sending out scheduling requests according to scheduling configuration, and not bearing service codes. The scheduling system is decoupled from the specific tasks, and the availability and stability of the system are improved. The execution module (executor) is responsible for receiving the scheduling request and executing the task logic.

The scheduling process is shown in fig. 1. And (3) batch ETL tasks, allocating the ETL tasks to an actuator by using a greedy scheduling algorithm, adding the tasks to a Quartz scheduler for timed execution, calling a remote actuator interface by the abstract Quartz task, and starting a local thread to run the ETL tasks when the actuator receives the execution calling of a new task. And calculating task priority by using a high response ratio priority scheduling algorithm, and acquiring thread resources to run the ETL task by the task distributed to the actuator according to the priority.

A greedy algorithm is an approximation algorithm that solves the optimal solution. When solving the problem, the choice that seems best at the present time is always made. That is, rather than considering global optimality, a locally optimal solution in some sense is made.

Suppose that there is a set of tasks S = { S1, S2, …, sn }, the size W of source data amount corresponding to each task in S = { W1, W2, …, wn }, and m work nodes U = { U1, U2, …, um }. The algorithm evenly distributes the tasks in the S to the working nodes U according to the size of the source data volume, and solves the distribution method with the shortest total task execution time. It is assumed that the more average the sum of the source data amount allocated to all tasks by each machine is, the more balanced the load of hardware resources is, and the smaller the total task execution time cost is.

FIG. 2 is a schematic flow chart of the greedy allocation algorithm of the present invention. The greedy allocation algorithm has the following specific steps:

step 101, sorting the task set S from large to small according to data volume, and storing the task set S into a queue A. A = { (si, wi) … (sj, wj) }, where wi ≧ wj, 1 ≦ i, j ≦ n.

And 102, taking out the task at the head of the queue from the queue A, calculating the total amount of the data volume of the task of each node, and distributing the task to ui, wherein i is more than or equal to 1 and less than or equal to m. And ui is the node with the minimum data amount in the current working nodes.

And 103, repeating the step 2 until the queue is empty, and finishing the algorithm.

High-response-to-priority scheduling is used in an ETL scheduling system to dynamically adjust the priority of ETL tasks. The ETL operation has the characteristics of different running time, and if the priority of the ETL task is dynamically adjusted, the tasks can have an opportunity to obtain resource running.

The algorithm is applied to an executor module of a scheduling framework, which is mainly designed as shown in fig. 2. The executor continuously receives the scheduling request from the scheduling center and then caches the scheduling request to the priority queue, and under the condition that the thread group resources are sufficient, the tasks of the priority queue can be executed by threads. When the thread group resources cannot meet the requirements, some ETL tasks have higher priorities due to longer waiting time, and if idle resources exist, the high-priority tasks are preferentially acquired to the execution opportunity. In order to avoid the chance that some tasks occupy too long time and other tasks cannot be executed, the execution sequence of the tasks is adjusted by designing the dynamic priority of the tasks, a typical consumption producer model is used, the priority queue is a buffer area of a producer, a task monitoring thread is a consumer agent (responsible for task allocation), and a thread of a thread group is a consumer.

Step 1, obtaining quality evaluation data and carrying out data preprocessing, and calculating a probability distribution function, an accumulative distribution function and an expectation matrix of a service provider based on the obtained evaluation data aiming at evaluation indexes of different dimensions;

in the above step, the obtaining of the quality evaluation data and the data preprocessing specifically include:

step 101, obtaining quality evaluation data from an evaluation database. The obtaining mode can comprise directly reading from a database, or capturing corresponding evaluation data from evaluation databases of different network platforms through a web crawler;

Since the recurrent neural network model is constructed based on the similarity structure, data is cluster-evaluated by using the similarity structure and some keywords as features. Some keywords that often appear in the evaluation data and affect various target analyses, such as "original", "may", "but", "lingo", and the like, are preset, and the evaluation data is clustered using subtractive clustering. The evaluation data itself may be regarded as a candidate focus, and the evaluation quantitative value of each evaluation data is calculated by the following formula:

wherein the content of the first and second substances,

are data pointsz _iThe potential value of (a) is,Ｎis the number of data points that are,r _ais a normal number. From the above, it can be seen that the evaluation quantitative value of each evaluation data isA function of the distance to all other data points. The point with the highest evaluated quantization value is selected as the cluster center. The evaluation quantification values of the data points of all the evaluation data gradually decrease, the decrease value depending on their distance to the cluster center. The next cluster center is the data point of the ratings data with the largest rating quantization value remaining, and so on. To cluster the evaluation data, an evaluation quantitative value of the evaluation data is calculated using the following formula based on a subtractive clustering algorithm:

wherein the content of the first and second substances,

is at time ofkData points of the calculated evaluation dataz _kThe evaluation quantitative value of (1).

The calculating of the probability distribution function, the cumulative distribution function and the expectation matrix of the service provider specifically includes:

Wherein the content of the first and second substances,

indicating that the xth user is about the service providerH _iEvaluation dimension ofS _jThe evaluation scale used for the quality evaluation was performed,i∈I，j∈J，x∈{1,2,…,N _ij}，ε∈{1,2,…,v}，η∈{1,2,…,v}。

I.e. by

Wherein the content of the first and second substances,

，

。

Wherein

second-order random dominance: if and only if F (x) ≠ G (x), and

，

third-order random dominance: if and only if F (x) ≠ G (x), and

，

x ∈ [ a, b ], is called F (x) the third-order random dominance is better than G (x), and is marked as F (x) TSDG (x);

when the random dominance criterion is used for service provider quality of service assessment, F (x) and G (x) are respectively service providersH _iAndH _hcumulative distribution function of the rating scale for the rating dimension, then when (x) takes random advantage over g (x), it indicates that the service provider is in this rating dimensionH _iRandom dominanceH _h。

The evaluation dimension can be established according to the random dominance criterionS _jRandom dominance relationship matrix between two service providers

Wherein

Wherein

based on the determined random dominance relation matrix, the evaluation dimensionality can be further described by utilizing the dominance functionS _jService providerH _iAndH _hrandom dominance degree in between. Based on random occupations obtainedThe process of determining the random dominance degree matrix through the dominance degree function is as follows:

Wherein the content of the first and second substances,

greater values of (A) indicate a preference for evaluation dimensionsS _jService providerH _iIs superior toH _hThe higher the degree of (c), otherwise the lower.

Based on the method, the evaluation dimension can be constructedS _jRandom dominance matrix between two service providers

,i,h∈I,i≠h,j∈J。

first, the desired matrix is

Normalized by the formula

Then, the entropy weight of the evaluation dimension is determined, the information entropy is the measurement of the disorder degree of the system, the influence of subjective factors can be avoided to a certain degree, the quantization problem of information is solved, and the definition of the information entropy is as follows:

where m is the number of evaluation targets. Entropy is a measure tool for the degree of information uncertainty using probability theory, and the more scattered the data is, the greater the uncertainty is. Generally, if the degree of variation of an index value is larger, the information entropy of the index value is smaller, the amount of information provided by the index is larger, and the weight of the index is larger; conversely, the smaller the weight. Therefore, the tool can calculate the weight 'entropy weight' of each evaluation dimension in three steps by using the information entropy.

Step 401, computing the output entropy, i.e.

Step 403, calculating the entropy weight of the evaluation dimension, i.e.

，i,h∈I,i≠h,j∈J. Wherein the content of the first and second substances,

Is calculated by the formula

Step 5, calculating the ranking value of each evaluation dimension and the overall service quality

From the priority matrix, a service provider can be calculated

And

and then the ranking value can be derived. According to a matrix

And

the calculation formula is

Wherein the content of the first and second substances,

the smaller, the representative service providerH _iEvaluation dimension ofS _jThe higher the level of (c).

According to

And

computing service providerH _iFor evaluation dimensionS _jRank value of

The calculation formula is as follows:

the obtained ranking value can be used for evaluating dimensionalityS _jThe same type of service providers are ranked. According to the matrix

Service provider can be calculatedH _iIs/are as follows

And

the calculation formula is

Wherein the content of the first and second substances,

the smaller, the representative service providerH _iThe higher the overall quality of service.

According to

And

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A quality data evaluation method based on ETL is characterized in that,

step 1, obtaining quality evaluation data and carrying out data preprocessing, and calculating a probability distribution function, an accumulative distribution function and an expectation matrix of a service provider based on the obtained evaluation data aiming at evaluation indexes of different dimensions; the ETL tasks are distributed to an actuator in batch by using a greedy scheduling algorithm, the tasks are added to a Quartz scheduler to be executed at regular time, the abstract Quartz task calls a remote actuator interface, and when the actuator receives the execution call of a new task, a local thread is started to run the ETL tasks;

calculating task priority by using a high-response-ratio priority scheduling algorithm, and acquiring thread resources to run an ETL (extract transform load) task by the task distributed to the actuator according to the priority;

the greedy allocation algorithm has the following specific steps:

step 101, sorting a task set S from large to small according to data quantity, and storing the task set S into a queue A;

a = { (si, wi) … (sj, wj) }, wherein wi is not less than wj, 1 is not less than i, and j is not less than n;

102, taking out the task at the head of the queue from the queue A, calculating the total amount of the data volume of the task of each node, and distributing the task to ui, wherein i is more than or equal to 1 and less than or equal to m;

ui is the node with the minimum total data volume in the current working nodes;

2. The quality evaluation method according to claim 1, wherein the acquiring of quality evaluation data and the data preprocessing specifically comprise:

step 101, obtaining quality evaluation data from an evaluation database;

3. The quality evaluation method according to claim 2, wherein the evaluation quantitative value of the evaluation data is calculated using the following formula based on a subtractive clustering algorithm:

wherein the content of the first and second substances,

4. The quality evaluation method according to one of claims 1 to 3, wherein the calculating a probability distribution function, a cumulative distribution function, and an expectation matrix of the service provider specifically comprises:

Wherein the content of the first and second substances,

I.e. by

Wherein the content of the first and second substances,

，

；

Wherein

。

5. The quality evaluation method according to one of claims 1 to 4, wherein a random dominance criterion is used to obtain a random dominance relationship matrix between two service providers of the same type in each evaluation dimension, and the method specifically comprises:

when the random dominance criterion is used for service provider quality of service assessment, F (x) and G (x) are respectively service providersH _iAndH _hcumulative distribution function of evaluation scale for evaluation dimension, then when (x) takes random advantage over g (x)Illustrate a service provider in this evaluation dimensionH _iRandom dominanceH _h；

Wherein

Wherein

6. The quality evaluation method according to one of claims 1 to 5, wherein the random dominance criterion specifically comprises:

second-order random dominance: if and only if F (x) ≠ G (x), and

，

third-order random dominance: if and only if F (x) ≠ G (x), and

，

7. The quality evaluation method according to one of claims 1 to 6, wherein the determining of the random dominance degree matrix by the dominance degree function specifically comprises:

Wherein the content of the first and second substances,

,i,h∈I,i≠h,j∈J。

8. The quality evaluation method according to one of claims 1 to 7, wherein the clustering of the random dominance degree matrices under each evaluation dimension by using entropy weights specifically comprises:

will expect the matrix

Normalized by the formula

where m is the number of evaluation targets.

9. The quality evaluation method according to claim 8, wherein determining the entropy weight of the evaluation dimension specifically comprises:

step 401, computing the output entropy, i.e.

Step 403, calculating the entropy weight of the evaluation dimension, i.e.

Is calculated by the formula

。

10. The quality evaluation method according to one of claims 1 to 9, wherein the calculating of the ranking values of each evaluation dimension and the overall quality of service specifically comprises:

calculating a service provider based on the priority matrix