CN108268489B

CN108268489B - Method and device for evaluating data platform

Info

Publication number: CN108268489B
Application number: CN201611259558.7A
Authority: CN
Inventors: 樊炼; 林洁; 薛超; 曾磊; 王卉; 郭慈; 徐庆; 张欣
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Group Hubei Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Group Hubei Co Ltd
Priority date: 2016-12-30
Filing date: 2016-12-30
Publication date: 2020-12-01
Anticipated expiration: 2036-12-30
Also published as: CN108268489A

Abstract

The invention discloses a method and a device for evaluating a data platform, wherein the method comprises the following steps: analyzing a Structured Query Language (SQL) statement related to a data entity in a data platform to obtain redundant data; analyzing an evaluation item including redundant data according to an Epanechnikow kernel function; evaluating the data platform according to the analyzed evaluation item. The embodiment of the invention also discloses a device for evaluating the data platform, which can evaluate the data platform in real time according to the evaluation items comprising the redundant data, is convenient for adjusting the related settings of the data platform in time, and ensures the working efficiency of the data platform.

Description

Method and device for evaluating data platform

Technical Field

The invention relates to the field of computers, in particular to a method and a device for evaluating a data platform.

Background

With the rapid development of applications such as mobile internet, internet of things and the like, the global data volume has increased explosively. The rapid increase in the amount of data predicts that the big data era has been entered. Not only is the data size larger and larger, but the complexity of processing large data is greatly increased by the large number of data types and high real-time requirements for processing data.

The signaling data in the communication field has a super large data volume, and the real-time requirement of the analysis service is gradually increased, so the method is particularly important for the health degree evaluation of a large data platform of a signaling analysis system.

In the prior art, when system resources or processing have alarms and faults, relevant processing is carried out, and normalized analysis cannot be carried out on a data platform.

Disclosure of Invention

The embodiment of the invention provides a method for evaluating a data platform, which can evaluate the data platform in real time according to an evaluation item comprising redundant data, is convenient for adjusting the related settings of the data platform in time, and ensures the working efficiency of the data platform.

The embodiment of the invention also provides a device for evaluating the data platform, which can evaluate the data platform in real time according to the evaluation items of the redundant data, is convenient for adjusting the relevant settings of the data platform in time, and ensures the working efficiency of the data platform.

A method of evaluating a data platform, the method comprising:

analyzing a Structured Query Language (SQL) statement related to a data entity in a data platform to obtain redundant data;

analyzing an evaluation item including redundant data according to an Epanechnikow kernel function;

evaluating the data platform according to the analyzed evaluation item.

Optionally, the analyzing SQL statements related to data entities in the data platform to obtain redundant data includes:

and analyzing SQL sentences related to the data entities in the data platform by using an edit distance algorithm to obtain redundant data.

Optionally, the analyzing, by using an edit distance algorithm, the SQL statements related to the data entities in the data platform to obtain the redundant data includes:

analyzing the SQL statement to obtain a data processing path and a data source of each model table;

combining and splicing the data structure corresponding to the data source and the data processing path in a character mode to form a processing characteristic character string of the model table;

and comparing the processing characteristic character strings of different model tables pairwise by using an edit distance algorithm to obtain redundant data.

Optionally, the analyzing the evaluation term including the redundant data according to the Epanechnikow kernel function includes:

obtaining bandwidth parameters according to the historical redundant data minimum mean square error;

the evaluation terms are analyzed in terms of bandwidth parameters, redundant data and Epanechnikow kernel functions.

Optionally, the evaluation item further includes:

one or more categories of space usage data, system load data, storage specification data, degree of standardization data, data usage data, or heat assessment data;

analyzing an evaluation term comprising redundant data according to an Epanechnikow kernel function, comprising:

for different categories, obtaining bandwidth parameters corresponding to the categories according to the historical category data minimum mean square error;

analyzing the evaluation item according to the bandwidth parameter corresponding to the category, the category data and the Epanechnikow kernel function;

the evaluating the data platform according to the post-analysis evaluation item includes:

and evaluating the data platform according to the evaluation items corresponding to the categories after the analysis and the weights corresponding to the categories.

An apparatus to evaluate a data platform, the apparatus comprising:

the analysis module is used for acquiring redundant data from a Structured Query Language (SQL) statement related to a data entity in the data platform;

an analysis module for analyzing an evaluation term including redundant data by using an Epanechnikow kernel function;

and the evaluation module is used for evaluating the data platform according to the evaluation items after analysis.

Optionally, the parsing module is further configured to parse, by using an edit distance algorithm, SQL statements related to data entities in the data platform to obtain redundant data.

Optionally, the parsing module is further configured to parse the SQL statement, and obtain a data processing path and a data source of each model table; combining and splicing the data structure corresponding to the data source and the processing path in a character mode to form a processing characteristic character string of the model; and comparing the processing characteristic character strings of different models pairwise by using an edit distance algorithm to obtain redundant data.

Optionally, the analysis module is further configured to minimize a mean square error according to the historical redundant data to obtain a bandwidth parameter; the evaluation terms are analyzed in terms of bandwidth parameters, redundant data and Epanechnikow kernel functions.

Optionally, the evaluation item further includes:

the analysis module is also used for obtaining bandwidth parameters corresponding to different categories according to the historical category data minimum mean square error; analyzing the evaluation item according to the bandwidth parameter corresponding to the category, the category data and the Epanechnikow kernel function;

and the evaluation module is also used for evaluating the data platform according to the analyzed evaluation items corresponding to the categories and the weights corresponding to the categories.

According to the technical scheme, in the embodiment of the invention, firstly, SQL sentences related to data entities in a data platform are analyzed to obtain redundant data; then analyzing an evaluation item comprising redundant data according to an Epanechnikow kernel function; and finally, evaluating the data platform according to the analyzed evaluation item. The data platform can be evaluated in real time according to the evaluation items of the redundant data, so that the related settings of the data platform can be conveniently adjusted in time subsequently, and the working efficiency of the data platform is ensured.

Drawings

The present invention will be better understood from the following description of specific embodiments thereof taken in conjunction with the accompanying drawings, in which like or similar reference characters designate like or similar features.

FIG. 1 is a schematic flow chart illustrating a method for evaluating a data platform according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating a process of analyzing SQL statements related to data entities in a data platform to obtain redundant data according to an embodiment of the present invention;

FIG. 3 is a flow chart illustrating an embodiment of analyzing an evaluation item including redundant data;

FIG. 4 is a schematic diagram of an apparatus for evaluating a data platform according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings and specific embodiments.

In the embodiment of the invention, various accidents are not fully considered while the data platform is established, so that redundant data exists in the data platform, and the unnecessary redundant data causes low working efficiency of the data platform. Analyzing SQL sentences related to data entities in a data platform to obtain redundant data; analyzing an evaluation item including redundant data according to an Epanechnikow kernel function; and finally evaluating the data platform. Because the data platform can be evaluated in real time according to the evaluation items of the redundant data, the generation of the redundant data is reduced by conveniently and timely adjusting the relevant settings of the data platform subsequently, and the working efficiency of the data platform is further ensured.

Referring to fig. 1, a schematic flow chart of a method for evaluating a data platform specifically includes the following steps:

101. and analyzing SQL sentences related to data entities in the data platform to obtain redundant data.

SQL is a database query and programming language for accessing data, querying, updating, and managing relational database systems. And through the analysis of the task log, SQL sentences related to each data entity in the data platform are obtained, and then redundant data is obtained.

Referring to fig. 2, obtaining redundant data by analyzing SQL statements related to data entities in a data platform specifically includes:

1011. and analyzing the SQL statement to obtain a data processing path and a data source of each model table.

The model table is abstract and summary of the entity table in the database, if the entity tables with the same table structure but different time points can be abstracted into the model table, the model table is taken as an object during concrete analysis, and repetition and redundancy of analysis results are avoided. And analyzing SQL sentences related to the data entities to obtain a data processing path and a data source of each model table. A data processing path refers to a logical path in the data processing process.

1012. And combining and splicing the data structure and the processing path corresponding to the data source in a character mode to form a processing characteristic character string of the model table.

Analyzing the data source model to obtain a data structure, combining and splicing the data structure of the data source model and the data processing path of the model table in a character mode to form a processing characteristic character string of each model table.

For example: the characteristic string of the model TABLE1 is [ TABLE structure information ] + [ process information ] (COL1| COL2| COL3) (TIME _ ID ═ 201612), wherein the data processing path is the data corresponding to the TIME _ ID character.

1013. And comparing the processing characteristic character strings of different model tables pairwise by using an edit distance algorithm to obtain redundant data.

The character string similarity algorithm is an algorithm for determining whether two character strings are similar, and specifically includes: and character string similarity calculation methods such as a Jaro-Winkler Distance algorithm (Jaro-Winkler Distance), a longest common substring algorithm (LCS) and a GST algorithm.

Any of the above string similarity algorithms can be used in the present invention, but in the algorithm selection, on one hand, the data characteristics of the telecommunication service need to be considered, and on the other hand, the character comparison performance also needs to be considered. First, the string of the data table processing procedure is composed of the SQL grammar on the data characteristic, and is an ordered string, so the string matching procedure thereof should be ordered. This is true for both the edit Distance algorithm (Jaro-Winkler Distance) and the GST algorithm, which also accounts for the comparison of two string change sequences. However, because the GST algorithm has a high time complexity of O (), the performance requirements of a system processed in 3 hours in ten thousand tables cannot be basically met in actual code operation, and the problem of a character string sequence can be well ordered uniformly by character preprocessing without being solved in the algorithm, so that the invention adopts an edit distance algorithm, and the following is an exemplary explanation of the edit distance algorithm:

two given character strings S₁And S₂The distance of (a) is:

m is the number of matched characters; t is the number of transpositions.

Two are respectively from S₁And S₂If the distance between the characters does not exceed

The two strings are considered to match. The characters matched with each other determine the number t of transposition, and in short, half of the number of matched characters in different sequences is the number t of transposition.

For example, the characters of MARTHA and MARTHA are all matched, but T and H in these matched characters need to be transposed to change MARTHA to MARTHA, then T and H are matched characters in different orders, and T is 2/2 is 1.

Then the distance between the two strings is:

whereas Jaro-Winkler gives a higher score to the initial part for the same string, defining a prefix p, and gives both strings, the Jaro-Winkler distance is, if the prefix parts are identical with a part of length l:

d_w＝d_j+[lp(1-d_j)] (2)

d_jis the distance of two strings; l is the same length of the prefix, but is specified to be at most 4; p is a constant for adjusting the fraction, provided that d cannot exceed 0.25, otherwise d may occur_wIn the case of more than 1, this constant is defined as 0.1.

Thus, the above mentioned Jaro-Winkler distances for MARTHA and MARHTA are:

d_w＝0.944+[3*0.1(1-0.944)]＝0.961

according to practical experience, when the Jaro-Winkler distance of the feature character strings of two different model tables is larger than 0.9, the processing process and the feature of the two model tables are considered to be similar, and the two models are redundant. I.e. the number of redundancies is 1 at this time.

And counting the redundancy times per day by taking the day as a unit, and dividing the redundancy times per day by the number of all the model tables per day to obtain the day redundancy.

And counting the redundancy times of each month by taking the month as a unit, and dividing the redundancy times of each month by the number of all model tables of each month to obtain the month redundancy.

The data redundancy, i.e. the redundant data, is equal to the redundancy of 0.7 day +0.3 month. Thus combining the redundancies in the two angles of day and month to obtain redundant data. That is, the redundant data is counted in units of days and months. Thereby ensuring the coverage and time span of redundant data.

102. The evaluation terms including the redundant data were analyzed according to the Epanechnikow kernel function.

The development trend of the future related data can be analyzed by using a nuclear density estimation algorithm. That is, by analyzing the evaluation item including the redundant data according to the Epanechnikow kernel function, it is possible to know whether the result of the evaluation item after evaluation is developed in a good direction or in a bad direction in a future time. The data platform is re-evaluated according to the trend developed in the manner described above.

The kernel density estimation algorithm proposed by Rosenblatt and Parzen is currently the most efficient and most widely used non-parametric density estimation algorithm. The data distribution characteristics are obtained only from the training samples, and can be used for estimating the density function of any shape. The unit variables and density estimates are described below.

Let x₁、x₂、x₃，…，x_nThe distribution function of the random variable is f (x) and x is equal to R.

Let (3) be the density estimate of the density function f (x), where K () is the kernel function; h is a bandwidth parameter.

For convenience, let K denote_hWhere (u) ═ K (u/h) h, then formula (3) can be expressed as:

as can be seen from equation (3), the kernel density estimate of the distribution function f is related to the given sample set, and also to the selection of the kernel function K and the selection of the bandwidth parameter h.

Among them, the present invention selects the Epanechnikow kernel function as the kernel function of the analysis distribution function f (x).

Epanechnikow kernel function:

K(u)＝0，|u|＞1

referring to fig. 3, a schematic flow chart of analyzing an evaluation item including redundant data specifically includes:

1021. obtaining bandwidth parameters according to historical redundant data minimum mean square error

The bandwidth parameter can be obtained by minimizing mean square error according to historical redundant data.

The selection method of the bandwidth parameter h comprises the following steps: the integrated mean square error MISE (h) is used as a criterion for judging whether the density measurement is good or bad.

Wherein:

AMISE (h) is referred to as progressive mean square integral error. σ is the average of the distances of the data from the mean, which is the root of the squared sum of the mean deviations, which reflects the degree of dispersion of a data set. Wherein

To minimize AMISE (h), h must be set at some intermediate value, so that f can be avoided_h(x) With too large a deviation (too smooth) or too large a variance (i.e. too smooth). Regarding h-minimization of amise (h) it is shown that it is best to exactly balance the order of the variance term and the deviation term in amise (h), the optimal bandwidth is:

wherein, K (x), f (x) are historical redundant data. Namely, the bandwidth parameter is obtained according to the minimum mean square error of the historical redundant data.

1022. Analyzing the evaluation term according to the bandwidth parameter, the redundant data and the Epanechnikow kernel function

The evaluation terms including the redundant data are analyzed according to formula 4 with the bandwidth data calculated at 1021, the redundant data in or obtained at 101, and the substituted Epanechnikow kernel function.

103. Evaluating the data platform according to the post-analysis evaluation item

And predicting and evaluating the data platform according to the analyzed evaluation items. For example, in the current situation of redundant data, the development trend of data platforms is to move towards the good direction or the poor direction.

Analyzing SQL sentences related to data entities in a data platform to obtain redundant data; analyzing an evaluation item including redundant data according to an Epanechnikow kernel function; and finally evaluating the data platform. The data platform can be evaluated in real time according to the evaluation items of the redundant data, namely, the development trend of the data platform can be evaluated by utilizing the technical scheme of the invention. Therefore, the related settings of the data platform can be conveniently adjusted in time subsequently, namely, the generation of redundant data is reduced, and further, the working efficiency of the data platform is ensured.

Further, on the basis of the above-described embodiments, the evaluation item may further include one or more of space usage data, system load data, storage specification data, degree of standardization data, data usage data, or heat evaluation data. That is, the evaluation item may further include one or more of the above categories on the basis of including the redundant data.

For different categories, firstly, according to the historical category data, the mean square error is minimized to obtain the bandwidth parameters corresponding to the data. I.e. different classes correspond to different bandwidth parameters. For example: the space usage data corresponds to a first bandwidth parameter; the stored specification data corresponds to a second bandwidth parameter.

And analyzing the evaluation items according to the bandwidth parameters and the category data corresponding to the categories and the Epanechnikow kernel function to obtain the analyzed evaluation items corresponding to the categories. And evaluating the data platform according to the analyzed evaluation items corresponding to the categories and the weights corresponding to the categories, wherein the weights occupied by the different categories are different.

Fig. 4 is a schematic structural diagram of an apparatus for evaluating a data platform, which corresponds to the method in the first embodiment. The method specifically comprises the following steps: a parsing module 401, an analysis module 402 and an evaluation module 403.

The parsing module 401 is configured to obtain redundant data from a structured query language SQL statement related to a data entity in the data platform.

An analysis module 402 for analyzing an evaluation term including redundant data for an Epanechnikow kernel function;

and an evaluation module 403, configured to evaluate the data platform according to the analyzed evaluation item.

Specifically, the parsing module 401 is further configured to parse, by using an edit distance algorithm, SQL statements related to data entities in the data platform to obtain redundant data.

Specifically, the parsing module 401 is further configured to parse the SQL statement to obtain a data processing path and a data source of each model table; combining and splicing the data structure corresponding to the data source and the processing path in a character mode to form a processing characteristic character string of the model; and comparing the processing characteristic character strings of different models pairwise by using an edit distance algorithm to obtain redundant data. The detailed process can be seen in step 101.

Specifically, the analysis module 402 is further configured to minimize a mean square error according to the historical redundant data to obtain a bandwidth parameter; the evaluation terms are analyzed in terms of bandwidth parameters, redundant data and Epanechnikow kernel functions.

In addition, the evaluation item further includes, on the basis of including the redundant data: one or more categories of space usage data, system load data, storage specification data, degree of standardization data, data usage data, or heat assessment data.

Specifically, the analysis module 402 is further configured to, for different categories, minimize a mean square error according to historical category data to obtain bandwidth parameters corresponding to the categories; and analyzing the evaluation item according to the bandwidth parameter corresponding to the category, the category data and the Epanechnikow kernel function.

Specifically, the evaluation module 403 is further configured to evaluate the data platform according to the analyzed evaluation items corresponding to the categories and the weights corresponding to the categories.

The technical effect of the device for evaluating the data platform in the second embodiment is the same as that of the method in the first embodiment, and is not described herein again.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of evaluating a data platform, the method comprising:

evaluating the data platform according to the analyzed evaluation items;

the data platform is based on signaling data; the analyzing SQL statements related to the data entities in the data platform to obtain redundant data comprises the following steps:

analyzing SQL sentences related to data entities in the data platform by using an edit distance algorithm to obtain redundant data; the method for analyzing SQL sentences related to data entities in the data platform by using the edit distance algorithm to obtain redundant data comprises the following steps:

2. The method of evaluating a data platform according to claim 1, wherein analyzing the evaluation terms including redundant data according to an Epanechnikow kernel function comprises:

3. The method of evaluating a data platform of claim 1, wherein the evaluation term further comprises:

4. An apparatus for evaluating a data platform, the apparatus comprising:

the evaluation module is used for evaluating the data platform according to the evaluation items after analysis;

the data platform is based on signaling data; the analysis module is also used for analyzing SQL sentences related to the data entities in the data platform by using an edit distance algorithm to obtain redundant data; the analysis module is also used for analyzing the SQL statement and acquiring a data processing path and a data source of each model table; combining and splicing the data structure corresponding to the data source and the processing path in a character mode to form a processing characteristic character string of the model; and comparing the processing characteristic character strings of different models pairwise by using an edit distance algorithm to obtain redundant data.

5. The apparatus for evaluating a data platform of claim 4, wherein the analysis module is further configured to minimize a mean square error from historical redundancy data to obtain a bandwidth parameter; the evaluation terms are analyzed in terms of bandwidth parameters, redundant data and Epanechnikow kernel functions.

6. The apparatus for evaluating a data platform of claim 4, wherein the evaluation term further comprises: