CN117931954A

CN117931954A - A database data interface management method based on flexible configuration

Info

Publication number: CN117931954A
Application number: CN202311496447.8A
Authority: CN
Inventors: 韩俊; 蔡超; 潘文婕; 张文嘉; 樊安洁; 陈皓菲; 王娜
Original assignee: Economic and Technological Research Institute of State Grid Jiangsu Electric Power Co Ltd
Current assignee: Economic and Technological Research Institute of State Grid Jiangsu Electric Power Co Ltd
Priority date: 2023-11-10
Filing date: 2023-11-10
Publication date: 2024-04-26

Abstract

本发明公开了一种基于灵活配置的数据库数据接口管理方法，包括：步骤1）获取数据库中不同属性的数据；步骤2）获取不同属性的代表性程度以及基准程度，获取代表性属性和基准性属性；在不同次迭代的过程中，构建属性二分图，获取最终的属性组合；步骤3）根据最终的属性组合进行数据分片聚类，实现精确存储。有益效果：该方法通过获取最佳的属性组合的方式进行准确的分片聚类，以提高查询效率。The present invention discloses a database data interface management method based on flexible configuration, comprising: step 1) obtaining data of different attributes in the database; step 2) obtaining the representativeness and benchmark degree of different attributes, obtaining representative attributes and benchmark attributes; in different iterations, constructing attribute bipartite graphs, obtaining the final attribute combination; step 3) performing data sharding and clustering according to the final attribute combination to achieve accurate storage. Beneficial effect: The method performs accurate sharding and clustering by obtaining the best attribute combination to improve query efficiency.

Description

A database data interface management method based on flexible configuration

技术领域Technical Field

本发明涉及数据库数据处理技术领域，尤其涉及一种基于灵活配置的数据库数据接口管理方法。The present invention relates to the technical field of database data processing, and in particular to a database data interface management method based on flexible configuration.

背景技术Background technique

在仿真计算调用与指标数据应用展示场景中，基础数据的提供极其重要。随着仿真逻辑与指标维度的不断加深，业务需求的不断增大，访问条件和查询结果的样式越来越多，需代码开发接口的工作也愈发繁重，不仅会加大工作量，还会严重影响客户的使用体验，导致大量的业务量损失。开发数据源接口的传统模式，不仅工作量大，也不好维护，还缺乏统一的管理、监控和异常快速恢复的机制，因此为了减少了需求分析、设计、开发、测试和部署的工作量，通过可配置化的方法大大提高接口的开发效率，做到低代码开发，将数据库接口资源统一管理。In the simulation calculation call and indicator data application display scenarios, the provision of basic data is extremely important. With the continuous deepening of simulation logic and indicator dimensions, the continuous increase in business needs, the increasing number of access conditions and query results, and the increasing number of code development interfaces, the work of code development interfaces is becoming more and more arduous, which will not only increase the workload, but also seriously affect the customer experience, resulting in a large amount of business loss. The traditional mode of developing data source interfaces is not only labor-intensive and difficult to maintain, but also lacks a unified management, monitoring and abnormal rapid recovery mechanism. Therefore, in order to reduce the workload of demand analysis, design, development, testing and deployment, the interface development efficiency is greatly improved through configurable methods, low-code development is achieved, and database interface resources are managed in a unified manner.

在基于灵活配置的数据库数据接口管理系统的大规模并发查询的场景下，这种系统可能会遇到一些性能瓶颈。例如，如果许多查询都涉及到同一张大表，那么这张表可能会成为性能瓶颈，导致查询速度降低。为了解决这个问题，传统的是采用数据分片聚类技术。这种技术将数据库中的数据划分为若干个较小的子集，称为“分片”。每个分片都可以单独进行查询和更新操作，从而提高系统的并行处理能力。同时，通过对数据库中数据进行聚类将其划分为一个“分片”中，可以进一步提高数据局部性，从而提升查询效率。然而在数据分片聚类的过程中，若对不同的属性数据进行分别聚类，进而得到不同的数据分片，会造成大量的空间冗余，并且还会降低查询效率。In the scenario of large-scale concurrent queries based on a flexible database data interface management system, such a system may encounter some performance bottlenecks. For example, if many queries involve the same large table, then this table may become a performance bottleneck, resulting in a slow query speed. To solve this problem, the traditional method is to use data sharding and clustering technology. This technology divides the data in the database into several smaller subsets, called "shards". Each shard can be queried and updated separately, thereby improving the parallel processing capabilities of the system. At the same time, by clustering the data in the database and dividing it into a "shard", the data locality can be further improved, thereby improving the query efficiency. However, in the process of data sharding and clustering, if different attribute data are clustered separately, and then different data shards are obtained, a large amount of spatial redundancy will be caused, and the query efficiency will also be reduced.

发明内容Summary of the invention

本发明目的在于克服上述现有技术的不足，提供了一种基于灵活配置的数据库数据接口管理方法，具体由以下技术方案实现：The present invention aims to overcome the deficiencies of the prior art and provides a database data interface management method based on flexible configuration, which is specifically implemented by the following technical solutions:

所述基于灵活配置的数据库数据接口管理方法，包括：The database data interface management method based on flexible configuration includes:

步骤1)获取数据库中不同属性的数据；Step 1) Obtain data of different attributes in the database;

步骤2)获取不同属性的代表性程度以及基准程度，获取代表性属性和基准性属性；在不同次迭代的过程中，构建属性二分图，获取最终的属性组合；Step 2) Obtain the representativeness and benchmark degree of different attributes, obtain representative attributes and benchmark attributes; construct attribute bipartite graphs in different iterations to obtain the final attribute combination;

步骤3)根据最终的属性组合进行数据分片聚类，实现精确存储。Step 3) Data is sharded and clustered according to the final attribute combination to achieve accurate storage.

所述基于灵活配置的数据库数据接口管理方法的进一步设计在于，所述步骤2)中在获取属性组合的方式的过程中，对二分图进行K-M匹配，将对应的K-M匹配结果作为属性组合结果并通过迭代的方式，获取每次迭代过程中的属性组合结果，综合获取最终的属性组合方式。A further design of the database data interface management method based on flexible configuration is that in the process of obtaining the attribute combination method in step 2), K-M matching is performed on the bipartite graph, the corresponding K-M matching result is used as the attribute combination result, and the attribute combination result in each iteration process is obtained through iteration, so as to comprehensively obtain the final attribute combination method.

所述基于灵活配置的数据库数据接口管理方法的进一步设计在于，所述步骤2)中根据式(1)得到第j个属性的代表性程度α_j，The further design of the database data interface management method based on flexible configuration is that in the step 2), the representativeness of the j-th attribute α _j is obtained according to formula (1),

式(1)中，N_j表示第j个属性的数据种类的数量，max(N)表示所有属性的数据种类的数量的最大值，J_s′表示除了第j个属性的其他所有属性的数量；cv_s(n_j)表示当第j个属性数据值为第n_j个数据值时，其他非第j个属性的第s个属性的数据值的变异系数值；In formula (1), _Nj represents the number of data types of the jth attribute, max(N) represents the maximum value of the number of data types of all attributes, _Js ′ represents the number of all attributes except the jth attribute; _cvs ( _nj ) represents the coefficient of variation of the data values of the sth attribute other than the _jth attribute when the data value of the jth attribute is the njth data value;

所述基于灵活配置的数据库数据接口管理方法的进一步设计在于，所述步骤2)中对所有属性的代表性程度进行线性归一化处理，选取最大的基准性程度对应的属性记为基准性属性。A further design of the database data interface management method based on flexible configuration is that in step 2), the representativeness of all attributes is linearly normalized, and the attribute corresponding to the largest benchmark degree is selected and recorded as the benchmark attribute.

所述基于灵活配置的数据库数据接口管理方法的进一步设计在于，所述步骤2)中基准程度通过当前属性下的整体的分布来确定，具体为：获取代表性属性相同时获取截断数据，以此类推，得到其他代表性属性下的截断数据，在所有的截断数据中，获取所有代表性属性下的截断数据的交集，所述交集即为第一截断数据，在各个截断数据中，根据式(2)计算第j个属性的基准性程度γ_j，The further design of the database data interface management method based on flexible configuration is that the benchmark degree in step 2) is determined by the overall distribution under the current attribute, specifically: when the representative attributes are the same, the truncated data is obtained, and so on, the truncated data under other representative attributes are obtained. Among all the truncated data, the intersection of the truncated data under all the representative attributes is obtained, and the intersection is the first truncated data. In each truncated data, the benchmark degree γ _j of the j-th attribute is calculated according to formula (2),

式(2)中，J_s ^′表示的其他非第j个属性的所有属性的数量；M表示所有截断数据的数量；R_m(s)表示其他非第j个属性的第s个属性下第m个截断数据中的数据与第j个属性下第m个截断数据中的数据的皮尔逊相关系数。In formula (2), _Js ^′ represents the number of all attributes other than the jth attribute; M represents the number of all truncated data; _Rm (s) represents the Pearson correlation coefficient between the data in the mth truncated data under the sth attribute of other than the jth attribute and the data in the mth truncated data under the jth attribute.

所述基于灵活配置的数据库数据接口管理方法的进一步设计在于，其中获取第一截断段数据的过程为：将数据库中的所有数据进行排列，以排列的位置序号作为定位，将代表性属性相同的数据进行截段处理，进而得到当前表性属性下的截断数据。A further design of the database data interface management method based on flexible configuration is that the process of obtaining the first truncated segment data is: all the data in the database are arranged, the arranged position sequence number is used as the positioning, the data with the same representative attributes are segmented, and the truncated data under the current tabular attribute is obtained.

所述基于灵活配置的数据库数据接口管理方法的进一步设计在于，所述步骤2)中获取最终的属性组合的过程具体为：The further design of the database data interface management method based on flexible configuration is that the process of obtaining the final attribute combination in step 2) is specifically as follows:

基于获取的代表性属性和基准性属性，通过构建属性二分图确定属性组合，其中二分图为代表性属性为设定相同值时的二分图，其中二分图的左节点设定为任意一个非代表性属性，而右节点设定为其他非代表性属性，左右节点之间的相连的边中相同属性的不进行相连，则对应的获取节点之间的边权值即可获取属性的二分图；Based on the representative attributes and benchmark attributes obtained, the attribute combination is determined by constructing an attribute bipartite graph, wherein the bipartite graph is a bipartite graph when the representative attributes are set to the same value, wherein the left node of the bipartite graph is set to any non-representative attribute, and the right node is set to other non-representative attributes, and the same attributes in the edges connected between the left and right nodes are not connected, then the corresponding edge weights between the acquired nodes can be used to obtain the attribute bipartite graph;

采用迭代的方式进行匹配，迭代过程为：在第一次迭代过程中，进行此次二分图的K-M匹配，在获取到匹配结果的属性组合后，计算根据属性组合进行聚类后的数据与之前聚类后的数据的差异性，若差异性小于预设差异性阈值则停止迭代；在获取的代表性属性的基础上，以基准性属性作为整体基准，计算相同代表性属性时的左节点和右节点代表的属性的数据分布曲线；The matching is performed in an iterative manner. The iterative process is as follows: in the first iteration, the K-M matching of the bipartite graph is performed. After the attribute combination of the matching result is obtained, the difference between the data clustered according to the attribute combination and the data clustered before is calculated. If the difference is less than the preset difference threshold, the iteration is stopped. On the basis of the representative attributes obtained, the benchmark attributes are used as the overall benchmark to calculate the data distribution curves of the attributes represented by the left node and the right node when the representative attributes are the same.

选取代表性属性的第d个数据值，其左节点为第v个非代表性属性节点，左节点的数据分布曲线横坐标为基准性属性，纵坐标为第v个非代表性属性，记为第一数据分布曲线；右节点为第u个非代表性属性节点，右节点的数据分布横坐标为基准性属性，纵坐标为第u个非代表性属性，记为第二数据分布曲线，采用每个数据分布曲线上的数据点，与对应属性规律分布曲线之间的差异，进而计算差异的相似性来作为边权的边权。Select the d-th data value of the representative attribute, whose left node is the v-th non-representative attribute node, the data distribution curve of the left node has the benchmark attribute on the horizontal axis, and the v-th non-representative attribute on the vertical axis, which is recorded as the first data distribution curve; the right node is the u-th non-representative attribute node, the data distribution of the right node has the benchmark attribute on the horizontal axis, and the u-th non-representative attribute on the vertical axis, which is recorded as the second data distribution curve, and the difference between the data point on each data distribution curve and the corresponding attribute regular distribution curve is used, and then the similarity of the difference is calculated as the edge weight of the edge weight.

所述基于灵活配置的数据库数据接口管理方法的进一步设计在于，规律分布曲线的获取过程具体为：设定规律分布曲线的横坐标为基准性属性，纵坐标为第v个非代表性属性，计算每个已有的横坐标的点以确定对应的纵坐标的数据值，即根据式(3)计算第q个横坐标值对应的纵坐标值y_q：The further design of the database data interface management method based on flexible configuration is that the acquisition process of the regular distribution curve is specifically as follows: the horizontal coordinate of the regular distribution curve is set as the reference attribute, the vertical coordinate is set as the vth non-representative attribute, and each existing horizontal coordinate point is calculated to determine the corresponding vertical coordinate data value, that is, the vertical coordinate value _yq corresponding to the qth horizontal coordinate value is calculated according to formula (3):

式(3)中，H表示根据第q个基准性属性进行DBSCAN聚类后聚簇的数量；W_h表示第h个聚簇中数据种类的数量；P_w表示第w个数据种类的出现的频率值；G_w表示第w个数据种类的数据值；表示第h个聚簇中第w个数据种类所在聚簇的分布特征，通过聚簇的中的数据之间的密度来表示；纵坐标值通过加权平均获取，通过/>表示数据种类的权重；In formula (3), H represents the number of clusters after DBSCAN clustering based on the qth benchmark attribute; W _h represents the number of data types in the hth cluster; P _w represents the frequency value of the wth data type; G _w represents the data value of the wth data type; Indicates the distribution characteristics of the cluster where the wth data type in the hth cluster is located, which is represented by the density between the data in the cluster; the vertical coordinate value is obtained by weighted average, through/> Indicates the weight of the data type;

根据获取的横坐标对应的纵坐标值，得到连续的分布曲线，对该分布曲线进行拟合得到对应的规律分布曲线，第v个非代表性属性节点的规律分布曲线记为第一规律分布曲线，第u个非代表性属性节点的规律分布曲线记为第二规律分布曲线；将第一数据分布曲线上的数据点的横坐标的纵坐标值与第一规律分布曲线的横坐标的纵坐标值作差，得到的差值的绝对值得到左节点的此横坐标的差异值，记为第一差异值，以此类推得到第右节点的同样横坐标的差异值，记为第二差异值，获取相同横坐标的第一差异值和第二差异值的比值与1进行作减法得到的结果的绝对值作为当前横坐标的差异值，则对应的获取多个横坐标的差异值的均值记为左节点和右节点之间的差异性，并将差异性进行反比例函数归一化处理，处理结果记为该两个节点的边权值，重复上述操作得到其他节点之间的边权值；对二分图进行K-M匹配，获取匹配结果中最大的边权值对应的两个节点，记为待组合节点，将待组合节点进行组合，形成规律分布曲线According to the ordinate value corresponding to the acquired horizontal coordinate, a continuous distribution curve is obtained, and the distribution curve is fitted to obtain the corresponding regular distribution curve. The regular distribution curve of the vth non-representative attribute node is recorded as the first regular distribution curve, and the regular distribution curve of the uth non-representative attribute node is recorded as the second regular distribution curve; the ordinate value of the horizontal coordinate of the data point on the first data distribution curve is subtracted from the ordinate value of the horizontal coordinate of the first regular distribution curve, and the absolute value of the difference is obtained to obtain the difference value of this horizontal coordinate of the left node, which is recorded as the first difference value. Similarly, the difference value of the same horizontal coordinate of the right node is obtained, which is recorded as is the second difference value, the absolute value of the result obtained by subtracting the ratio of the first difference value and the second difference value of the same horizontal coordinate from 1 is obtained as the difference value of the current horizontal coordinate, and the corresponding average of the difference values of multiple horizontal coordinates is recorded as the difference between the left node and the right node, and the difference is normalized by an inverse proportional function, and the processing result is recorded as the edge weight of the two nodes, and the above operation is repeated to obtain the edge weight between other nodes; K-M matching is performed on the bipartite graph, and the two nodes corresponding to the largest edge weight in the matching result are obtained, recorded as the nodes to be combined, and the nodes to be combined are combined to form a regular distribution curve

所述基于灵活配置的数据库数据接口管理方法的进一步设计在于，判断是否需要继续进行迭代的条件为：计算第一聚类结果和第二聚类结果的NMI归一化互信息值的大小，若大于设定阀值，则停止迭代，获取的最终的属性组合；其中,第一聚类结果为前一次DBSCAN聚类迭代结果,第二次聚类结果为前一次的下一次DBSCAN聚类迭代结果。The further design of the database data interface management method based on flexible configuration is that the condition for judging whether to continue iteration is: calculating the size of the NMI normalized mutual information value of the first clustering result and the second clustering result. If it is greater than the set threshold, the iteration is stopped to obtain the final attribute combination; wherein, the first clustering result is the previous DBSCAN clustering iteration result, and the second clustering result is the next DBSCAN clustering iteration result of the previous one.

本发明的优点如下：The advantages of the present invention are as follows:

本发明中通过获取最佳的属性组合的方式进行准确的分片聚类，以提高查询效率。其中根据同一个属性的数据变化特征来获取代表性属性，同时根据数据的整体分布获取基准性属性，进而在不同的迭代过程中建立二分图。对获取的二分图进行边权值的计算，根据基准性属性在获取边权值的过程中，获取准确的节点之间的相似性，进而准确的获取边权值，进而根据二分图的K-M匹配结果获取最佳的属性组合，根据该最佳的属性组合进行数据库的分片聚类，获取准确的分片结果，提高了了管理系统的查询效率。The present invention performs accurate sharding clustering by obtaining the best attribute combination to improve query efficiency. Representative attributes are obtained according to the data change characteristics of the same attribute, and benchmark attributes are obtained according to the overall distribution of the data, and then a bipartite graph is established in different iterative processes. The edge weights of the obtained bipartite graph are calculated, and in the process of obtaining the edge weights according to the benchmark attributes, the similarity between the nodes is accurately obtained, and then the edge weights are accurately obtained, and then the best attribute combination is obtained according to the K-M matching result of the bipartite graph, and the database is sharded and clustered according to the best attribute combination to obtain accurate sharding results, thereby improving the query efficiency of the management system.

具体实施方式Detailed ways

以下对本发明的技术方案进行详细说明。The technical solution of the present invention is described in detail below.

本实施例的步骤1：获取数据库中的不同属性的数据。Step 1 of this embodiment: obtaining data of different attributes in the database.

本案中获取数据库中的不同属性的数据，其中该管理系统主要用于电力数据的管理，因此本实施例以电力数据来进行说明，其中不同属性的数据包括用户ID，用户名称，时间，用电类别，电量，电费等，其中一条数据中包含有不同属性的数据。其中为了计算方便，将同一个属性的数据进行线性归一化处理。In this case, data with different attributes are obtained from the database, where the management system is mainly used for the management of power data. Therefore, this embodiment is explained with power data, where data with different attributes include user ID, user name, time, power usage category, power consumption, electricity fee, etc., and one piece of data contains data with different attributes. For the convenience of calculation, the data with the same attribute is linearly normalized.

步骤2：获取不同属性的代表性程度以及基准程度，获取代表性属性和基准性属性。在不同次迭代的过程中，构建属性二分图，获取最终的属性组合。Step 2: Obtain the representativeness and benchmark degree of different attributes, obtain representative attributes and benchmark attributes. In different iterations, construct attribute bipartite graphs to obtain the final attribute combination.

需要说明的是，传统的数据分片过程中，通过对数据进行聚类，可以进一步提高数据局部性，从而提升查询效率。然而在数据分片聚类的过程中，若对不同的属性数据分别进行聚类，得到不同属性下的数据分片结果，然而在这些数据分片结果中，由于属性之间具有一定的联系性，因此得到的不同的数据分片中也会存在一定的重复性，即对应的某些数据分片之间存在着大量的重复，进而造成空间冗余，并且还会降低查询效率。本发明为了对数据进行精确的聚类，期望通过属性组合的方式,即对应的为通过从原本的单个属性进行聚类，变为通过多个属性进行聚类，以达到精确的聚类的同时，还保证查询高效性。在获取属性组合的过程中，由于不同属性的数据包含着整条数据的信息内容不同，并且属性数据的变化对于整条数据的基准程度不同(即对应的为某些属性中它的变化不受到其他属性数据的影响，那么这类属性数据可以表征整体数据的基准分布)，根据这些属性数据作为属性组合的表征条件，可以提高数据聚类的准确性。It should be noted that in the traditional data sharding process, by clustering the data, the data locality can be further improved, thereby improving the query efficiency. However, in the process of data sharding clustering, if different attribute data are clustered separately, the data sharding results under different attributes are obtained. However, in these data sharding results, due to the certain connection between the attributes, there will be certain repetitiveness in the different data shards obtained, that is, there are a large number of repetitions between the corresponding certain data shards, which will cause spatial redundancy and reduce the query efficiency. In order to accurately cluster the data, the present invention is expected to achieve accurate clustering by attribute combination, that is, corresponding to clustering from the original single attribute to clustering through multiple attributes, so as to achieve accurate clustering while ensuring query efficiency. In the process of obtaining attribute combination, since the data of different attributes contain different information content of the whole data, and the change of attribute data has different benchmarks for the whole data (that is, corresponding to the change of some attributes in which it is not affected by other attribute data, then this type of attribute data can characterize the benchmark distribution of the overall data), according to these attribute data as the characterization conditions of attribute combination, the accuracy of data clustering can be improved.

需要进一步说明的是，在获取属性组合的方式的过程中，采用二分图的处理方法，对二分图进行K-M匹配，则对应的K-M匹配结果为合适的属性组合结果。其中通过迭代的方式，获取每次迭代过程中的属性组合结果，综合获取最终的属性组合方式。It should be further explained that in the process of obtaining the attribute combination method, the bipartite graph processing method is used to perform K-M matching on the bipartite graph, and the corresponding K-M matching result is the appropriate attribute combination result. The attribute combination result in each iteration process is obtained through iteration, and the final attribute combination method is obtained comprehensively.

步骤2-1)获取不同属性的代表性程度以及基准程度，获取代表性属性和基准性属性。Step 2-1) Obtain the representativeness and benchmark degree of different attributes, and obtain representative attributes and benchmark attributes.

根据数据库的所有数据中，同一个属性的数据变化特征体现着属性的代表性程度以及基准程度。其中对于属性数据的代表性程度来说，代表性程度表征着此属性的信息能力，其中第j个属性的代表性程度a_j的计算方法为：According to all the data in the database, the data change characteristics of the same attribute reflect the representativeness and benchmark degree of the attribute. For the representativeness of attribute data, the representativeness characterizes the information capacity of this attribute. The calculation method of the representativeness _aj of the jth attribute is:

式中，N_j表示第j个属性的数据种类的数量(出现多少种不同的数据值)；max(N)表示所有属性的数据种类的数量的最大值；J_s ^′表示除了第j个属性的其他所有属性的数量；Where _Nj represents the number of data types of the jth attribute (how many different data values appear); max(N) represents the maximum number of data types of all attributes; _Js ^′ represents the number of all attributes except the jth attribute;

式中，cv_s(n_j)表示当第j个属性数据值为第n_j个数据值时，其他非第j个属性的第s个属性的数据值的变异系数值。其中该值表示的为当第j个属性值唯一时，其他属性的数据分布，若其他属性的数据分布较为离散(变异系数较大)，则表明该值的代表性程度小。Where _cvs ( _nj ) represents the coefficient of variation of the data value of the sth attribute other than the _jth attribute when the jth attribute data value is the njth data value. This value represents the data distribution of other attributes when the jth attribute value is unique. If the data distribution of other attributes is relatively discrete (large coefficient of variation), it indicates that the representativeness of this value is low.

对所有属性的代表性程度进行线性归一化处理，选取最大的基准性程度对应的属性记为基准性属性。The representativeness of all attributes is linearly normalized, and the attribute corresponding to the maximum benchmark degree is selected as the benchmark attribute.

其中对于属性数据的基准性程度来说，基准性程度表征着此属性的表征整体数据整体基准变化程度。其中基准性程度通过此属性下的整体的分布来确定，因此需要获取代表性属性相同时获取第一截断数据，并在该数据中来确定属性的基准性程度，其中获取第一截断段数据的过程为：将数据库中的所有数据进行排列，以排列的位置序号作为定位。根据上述步骤得到所有的代表性属性，以某个代表性属性为例，将代表性属性相同的数据进行截段处理(需要特别说明的是，截断处理后，数据不一定是连续的排列)，进而得到此代表性属性下的截断数据；类似操作，可以得到其他代表性属性下的截断数据，在这些所有的截断数据中，获取这些数据的交集，这些交集即为第一截断数据。在各个截断数据中，来计算第j个属性的基准性程度γ_j，其计算方法为：For the benchmark degree of attribute data, the benchmark degree represents the overall benchmark change degree of the overall data representing this attribute. The benchmark degree is determined by the overall distribution under this attribute. Therefore, it is necessary to obtain the first truncated data when the representative attributes are the same, and determine the benchmark degree of the attribute in the data. The process of obtaining the first truncated segment data is: arrange all the data in the database, and use the arranged position sequence number as the positioning. According to the above steps, all representative attributes are obtained. Taking a certain representative attribute as an example, the data with the same representative attribute is segmented (it should be noted that after the truncation process, the data is not necessarily arranged continuously), and then the truncated data under this representative attribute is obtained; similar operations can be performed to obtain truncated data under other representative attributes. Among all these truncated data, the intersection of these data is obtained, and these intersections are the first truncated data. In each truncated data, the benchmark degree γ _j of the jth attribute is calculated, and the calculation method is:

式中，J_s ^′表示的其他非第j个属性的所有属性的数量；M表示所有截断数据的数量；R_m(s)表示其他非第j个属性的第s个属性下第m个截断数据中的数据，与第j个属性下第m个截断数据中的数据的皮尔逊相关系数。其中在同一个截断数据中若第j个属性的数据与其他非第j个属性下的截断数据之间皮尔逊相关系数越小，则表明第j个属性的数据的变化不会影响其他属性数据的变化，则对应的第j个属性的基准线程度越大。In the formula, J _s ^′ represents the number of all attributes other than the jth attribute; M represents the number of all truncated data; R _m (s) represents the Pearson correlation coefficient between the data in the mth truncated data under the sth attribute of other non-jth attributes and the data in the mth truncated data under the jth attribute. In the same truncated data, if the Pearson correlation coefficient between the data of the jth attribute and the truncated data under other non-jth attributes is smaller, it means that the change of the data of the jth attribute will not affect the change of other attribute data, and the corresponding jth attribute has a greater baseline degree.

同样，对所有属性的基准性程度进行线性归一化处理，选取最大的基准性程度对应的属性记为基准性属性。Similarly, the benchmark levels of all attributes are linearly normalized, and the attribute corresponding to the largest benchmark level is selected as the benchmark attribute.

至此，获取不同属性的代表性程度以及基准程度，获取代表性属性和基准性属性。At this point, the representativeness and benchmark degrees of different attributes are obtained, and representative attributes and benchmark attributes are obtained.

步骤2-2)在不同次迭代的过程中，构建属性二分图，获取最终的属性组合。Step 2-2) During different iterations, construct an attribute bipartite graph to obtain the final attribute combination.

根据上述步骤获取得到代表性属性和基准性属性，其中对于代表性属性来说，代表性属性具有很强的代表性，则对应的在属性组合中，代表性属性必定在其中，其中构建属性二分图来确定属性组合，其中二分图为代表性属性为某个相同值时的二分图(则对应的二分图存在多个)，其中二分图的左节点为任意一个非代表性属性，而右节点也为其他非代表性属性，其中左右节点之间的相连的边中相同属性的不进行相连，则对应的获取节点之间的边权值即可获取属性的二分图。其中为了保证最优的属性组合，采用迭代的方式进行匹配，即迭代过程为：在第一次迭代过程中，进行此次二分图的K-M匹配。在获取到匹配结果的属性组合后，计算根据属性组合进行聚类后的数据与之前聚类后的数据的差异性，预设差异性阈值为0.45，若差异性小于所述差异性阈值则表明需要停止迭代。其中迭代过程是判断是否需要增加属性组合的数量，差异性阈值可根据实施者具体实施情况而定。According to the above steps, representative attributes and benchmark attributes are obtained. For representative attributes, representative attributes are highly representative. In the corresponding attribute combination, the representative attributes must be included. An attribute bipartite graph is constructed to determine the attribute combination. The bipartite graph is a bipartite graph when the representative attribute is a certain same value (the corresponding bipartite graph has multiple), the left node of the bipartite graph is any non-representative attribute, and the right node is also other non-representative attributes. The same attributes in the connected edges between the left and right nodes are not connected. The corresponding edge weights between the nodes can obtain the attribute bipartite graph. In order to ensure the optimal attribute combination, the iterative method is used for matching, that is, the iterative process is: in the first iteration process, the K-M matching of the bipartite graph is performed. After obtaining the attribute combination of the matching result, the difference between the data after clustering according to the attribute combination and the data after clustering before is calculated. The preset difference threshold is 0.45. If the difference is less than the difference threshold, it indicates that the iteration needs to be stopped. The iterative process is to determine whether the number of attribute combinations needs to be increased, and the difference threshold can be determined according to the specific implementation of the implementer.

其中边权值表征的为左节点和右节点的之间的相似性关系。由于存在多种不同的属性，则对应的在计算相似性时不能仅考虑根据左节点对应的属性的聚类结果，与右节点对应的属性的聚类结果的相似性，即对应的根据单个属性的聚类结果在不同属性上的表现的内容不同，因此相似性计算得到的结果不准确。因此本发明在获取的代表性属性的基础上，以基准性属性作为整体基准，计算相同代表性属性时的左节点和右节点代表的属性的数据分布曲线(其中数据分布通过坐标系的方式进行体现)。The edge weight represents the similarity relationship between the left node and the right node. Since there are many different attributes, when calculating the similarity, we cannot only consider the clustering result based on the attribute corresponding to the left node and the similarity with the clustering result based on the attribute corresponding to the right node, that is, the corresponding clustering result based on a single attribute has different contents on different attributes, so the result obtained by the similarity calculation is inaccurate. Therefore, based on the representative attributes obtained, the present invention takes the benchmark attributes as the overall benchmark, and calculates the data distribution curve of the attributes represented by the left node and the right node when the same representative attributes are used (wherein the data distribution is reflected in the form of a coordinate system).

其中选取代表性属性的第d个数据值来说，其左节点为第v个非代表性属性节点，左节点的数据分布曲线横坐标为基准性属性，纵坐标为第v个非代表性属性，记为第一数据分布曲线；右节点为第u个非代表性属性节点，右节点的数据分布横坐标为基准性属性，纵坐标为第u个非代表性属性，记为第二数据分布曲线。其中由于第一数据分布分布曲线和第二数据分布曲线是离散的，因此本发明采用每个数据点(数据分布曲线上的数据点)，与对应属性规律分布曲线之间的差异，进而计算差异的相似性来作为边权的边权。For example, the dth data value of the representative attribute is selected, and its left node is the vth non-representative attribute node. The horizontal coordinate of the data distribution curve of the left node is the benchmark attribute, and the vertical coordinate is the vth non-representative attribute, which is recorded as the first data distribution curve; the right node is the uth non-representative attribute node, and the horizontal coordinate of the data distribution of the right node is the benchmark attribute, and the vertical coordinate is the uth non-representative attribute, which is recorded as the second data distribution curve. Since the first data distribution curve and the second data distribution curve are discrete, the present invention uses the difference between each data point (data point on the data distribution curve) and the corresponding attribute regular distribution curve, and then calculates the similarity of the difference as the edge weight of the edge weight.

其中以第v个非代表性属性节点为例来说明获取规律分布曲线的过程：规律分布曲线的横坐标也同样为基准性属性，纵坐标为第v个非代表性属性。其中对于横坐标来说，获取所有的基准性属性的数据值，计算每个已有的横坐标的点来确定对应的纵坐标的数据值，其中第q个横坐标值对应的纵坐标值y_q的计算方法为：The process of obtaining the regular distribution curve is illustrated by taking the vth non-representative attribute node as an example: the horizontal coordinate of the regular distribution curve is also the benchmark attribute, and the vertical coordinate is the vth non-representative attribute. For the horizontal coordinate, obtain the data values of all benchmark attributes, calculate each existing horizontal coordinate point to determine the corresponding vertical coordinate data value, and the calculation method of the vertical coordinate value _yq corresponding to the qth horizontal coordinate value is:

式中，H表示根据第q个基准性属性进行DBSCAN聚类后聚簇的数量；W_h表示第h个聚簇中数据种类的数量；P_w表示第w个数据种类的出现的频率值；G_w表示第w个数据种类的数据值；Where H represents the number of clusters after DBSCAN clustering based on the qth benchmark attribute; W _h represents the number of data types in the hth cluster; P _w represents the frequency value of the wth data type; G _w represents the data value of the wth data type;

式中，表示第h个聚簇中第w个数据种类所在聚簇的分布特征，通过聚簇的中的数据之间的密度来表示。其中通过加权平均来获取纵坐标值，其中通过/>表示该数据种类的权重，若此数据所在聚簇较为分散，则表明该数据种类的表征较为随机，则对应的该数据种类的权重较小。In the formula, Indicates the distribution characteristics of the cluster where the wth data type in the hth cluster is located, which is represented by the density between the data in the cluster. The vertical coordinate value is obtained by weighted average, where / > Indicates the weight of the data type. If the cluster where the data is located is relatively scattered, it means that the representation of the data type is relatively random, and the corresponding weight of the data type is relatively small.

根据获取的横坐标对应的纵坐标值，得到较为连续的分布曲线，对该分布曲线进行拟合得到对应的规律分布曲线，其中左节点(第v个非代表性属性节点)的规律分布曲线记为第一规律分布曲线，右节点(第u个非代表性属性节点)的规律分布曲线记为第二规律分布曲线。According to the obtained ordinate value corresponding to the horizontal coordinate, a relatively continuous distribution curve is obtained, and the distribution curve is fitted to obtain the corresponding regular distribution curve, wherein the regular distribution curve of the left node (the vth non-representative attribute node) is recorded as the first regular distribution curve, and the regular distribution curve of the right node (the uth non-representative attribute node) is recorded as the second regular distribution curve.

则对应的计算第一数据分布曲线与第一规律分布曲线的差值，以及第二数据分布曲线与第二规律分布曲线的差值，其中计算差值时，将第一数据分布曲线上的数据点的横坐标的纵坐标值与第一规律分布曲线的横坐标的纵坐标值作差，得到的差值的绝对值得到左节点的此横坐标的差异值，记为第一差异值(包含多个第一差异值，一个横坐标对应一个)，类似操作，得到第右节点的同样横坐标的差异值，记为第二差异值(同样包含多个第二差异值，一个横坐标对应一个)。其中获取相同横坐标的第一差异值和第二差异值的比值与1进行作减法得到的结果，该结果的绝对值为此横坐标的差异值，则对应的获取多个横坐标的差异值的均值记为左节点和右节点之间的差异性。其中将该差异性进行反比例函数归一化处理，得到的记为该两个节点的边权值。Then the difference between the first data distribution curve and the first regular distribution curve, and the difference between the second data distribution curve and the second regular distribution curve are calculated accordingly. When calculating the difference, the ordinate value of the horizontal coordinate of the data point on the first data distribution curve is subtracted from the ordinate value of the horizontal coordinate of the first regular distribution curve. The absolute value of the difference is the difference value of this horizontal coordinate of the left node, which is recorded as the first difference value (including multiple first difference values, one horizontal coordinate corresponds to one). Similar operations are performed to obtain the difference value of the same horizontal coordinate of the right node, which is recorded as the second difference value (also including multiple second difference values, one horizontal coordinate corresponds to one). The ratio of the first difference value and the second difference value of the same horizontal coordinate is obtained by subtracting 1, and the absolute value of the result is the difference value of this horizontal coordinate. The corresponding average of the difference values of multiple horizontal coordinates is recorded as the difference between the left node and the right node. The difference is normalized by an inverse proportional function, and the result is recorded as the edge weight of the two nodes.

类似上述操作，可以得到其他节点之间的边权值。对此二分图进行K-M匹配，获取匹配结果中最大的边权值对应的两个节点，记为待组合节点。将待组合节点进行组合(类似二元组)。其中判断是否需要继续进行迭代的条件为：其中第一次不进行迭代条件的计算，对于第二次来说，根据此次迭代过程中属性组合对数据进行DBSCAN聚类，记为第二聚类结果，前一次迭代过程中为第一聚类结果。计算第一聚类结果和第二聚类结果的差异，其中计算两个聚类结果的NMI(Normalized Mutual Information)，归一化互信息值的大小，设置阈值0.65，若大于该阈值，则停止迭代，获取的最终的属性组合。Similar to the above operation, the edge weights between other nodes can be obtained. Perform K-M matching on this bipartite graph, and obtain the two nodes corresponding to the largest edge weights in the matching results, which are recorded as the nodes to be combined. Combine the nodes to be combined (similar to a tuple). The condition for judging whether to continue iteration is: the iteration condition is not calculated for the first time. For the second time, DBSCAN clustering is performed on the data according to the attribute combination in this iteration process, which is recorded as the second clustering result. The previous iteration process is the first clustering result. Calculate the difference between the first clustering result and the second clustering result, wherein the NMI (Normalized Mutual Information) of the two clustering results is calculated, the size of the normalized mutual information value is set, and the threshold is set to 0.65. If it is greater than the threshold, stop the iteration and obtain the final attribute combination.

步骤3根据最终的属性组合进行数据分片聚类，实现精确存储。Step 3 performs data sharding and clustering based on the final attribute combination to achieve accurate storage.

根据上述步骤，得到最终的属性组合，根据该属性组合对应的数据值作为数据的分片聚类的条件，得到分片聚类的结果，将分片聚类结果的对应的数据进行存储，并构建索引便于查询。According to the above steps, the final attribute combination is obtained, and the data value corresponding to the attribute combination is used as the condition for sharding clustering of the data to obtain the sharding clustering result. The corresponding data of the sharding clustering result is stored, and an index is constructed for easy query.

以上所述，仅为本发明较佳的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，可轻易想到的变化或替换，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应该以权利要求的保护范围为准。The above is only a preferred specific embodiment of the present invention, but the protection scope of the present invention is not limited thereto. Any changes or substitutions that can be easily thought of by a person skilled in the art within the technical scope disclosed by the present invention should be included in the protection scope of the present invention. Therefore, the protection scope of the present invention should be based on the protection scope of the claims.

Claims

1. A database data interface management method based on flexible configuration, characterized by comprising:

Step 1) Obtain data of different attributes in the database;

Step 2) Obtain the representativeness and benchmark degree of different attributes, obtain representative attributes and benchmark attributes; construct attribute bipartite graphs in different iterations to obtain the final attribute combination;

Step 3) Data is partitioned and clustered according to the final attribute combination to achieve accurate storage.

2. According to the database data interface management method based on flexible configuration according to claim 1, it is characterized in that in the process of obtaining the attribute combination method in step 2), K-M matching is performed on the bipartite graph, and the corresponding K-M matching result is used as the attribute combination result and the attribute combination result in each iteration process is obtained through iteration to comprehensively obtain the final attribute combination method.

3. The database data interface management method based on flexible configuration according to claim 1, characterized in that in step 2), the representative degree α _j of the j-th attribute is obtained according to formula (1),

In formula (1), _Nj represents the number of data types of the jth attribute, max(N) represents the maximum value of the number of data types of all attributes, _Js ′ represents the number of all attributes except the jth attribute; _cvs ( _nj ) represents the coefficient of variation of the data values of the sth attribute other than the _jth attribute when the jth attribute data value is the njth data value.

4. According to the database data interface management method based on flexible configuration as described in claim 3, it is characterized in that in the step 2), the representativeness of all attributes is linearly normalized, and the attribute corresponding to the largest benchmark degree is selected and recorded as the benchmark attribute.

5. The database data interface management method based on flexible configuration according to claim 3 is characterized in that the benchmark degree in step 2) is determined by the overall distribution under the current attribute, specifically: when the representative attributes are the same, the truncated data are obtained, and so on, the truncated data under other representative attributes are obtained. Among all the truncated data, the intersection of the truncated data under all the representative attributes is obtained, and the intersection is the first truncated data. In each truncated data, the benchmark degree γ _j of the jth attribute is calculated according to formula (2),

In formula (2), J _′s represents the number of all attributes other than the jth attribute; M represents the number of all truncated data; _Rm (s) represents the Pearson correlation coefficient between the data in the mth truncated data under the sth attribute of other than the jth attribute and the data in the mth truncated data under the jth attribute.

6. According to the database data interface management method based on flexible configuration according to claim 5, it is characterized in that the process of obtaining the first truncated segment data is: all the data in the database are arranged, the arranged position serial number is used as the positioning, the data with the same representative attributes are segmented, and the truncated data under the current tabular attribute is obtained.

7. The database data interface management method based on flexible configuration according to claim 5, characterized in that the process of obtaining the final attribute combination in step 2) is specifically:

Based on the representative attributes and benchmark attributes obtained, the attribute combination is determined by constructing an attribute bipartite graph, wherein the bipartite graph is a bipartite graph when the representative attributes are set to the same value, wherein the left node of the bipartite graph is set to any non-representative attribute, and the right node is set to other non-representative attributes, and the same attributes in the edges connected between the left and right nodes are not connected, then the corresponding edge weights between the nodes can be obtained to obtain the attribute bipartite graph;

The matching is performed in an iterative manner. The iterative process is as follows: in the first iteration, the K-M matching of the bipartite graph is performed. After the attribute combination of the matching result is obtained, the difference between the data clustered according to the attribute combination and the data clustered before is calculated. If the difference is less than the preset difference threshold, the iteration is stopped. On the basis of the representative attributes obtained, the benchmark attributes are used as the overall benchmark to calculate the data distribution curves of the attributes represented by the left node and the right node when the representative attributes are the same.

Select the d-th data value of the representative attribute, whose left node is the v-th non-representative attribute node, the data distribution curve of the left node has the benchmark attribute on the horizontal axis, and the v-th non-representative attribute on the vertical axis, which is recorded as the first data distribution curve; the right node is the u-th non-representative attribute node, the data distribution of the right node has the benchmark attribute on the horizontal axis, and the u-th non-representative attribute on the vertical axis, which is recorded as the second data distribution curve, and the difference between the data point on each data distribution curve and the corresponding attribute regular distribution curve is used, and then the similarity of the difference is calculated as the edge weight of the edge weight.

8. According to the database data interface management method based on flexible configuration as claimed in claim 7, it is characterized in that the acquisition process of the regular distribution curve is specifically as follows: the horizontal coordinate of the regular distribution curve is set as the benchmark attribute, the vertical coordinate is set as the vth non-representative attribute, and each existing horizontal coordinate point is calculated to determine the corresponding vertical coordinate data value, that is, the vertical coordinate value _yq corresponding to the qth horizontal coordinate value is calculated according to formula (3):

In formula (3), H represents the number of clusters after DBSCAN clustering based on the qth benchmark attribute; W _h represents the number of data types in the qth cluster; P _w represents the frequency value of the wth data type; G _w represents the data value of the wth data type; Indicates the distribution characteristics of the cluster where the wth data type in the th cluster is located, which is represented by the density between the data in the cluster; the vertical coordinate value is obtained by weighted average, through/> Indicates the weight of the data type;

According to the ordinate value corresponding to the obtained horizontal coordinate, a continuous distribution curve is obtained, and the distribution curve is fitted to obtain the corresponding regular distribution curve. The regular distribution curve of the vth non-representative attribute node is recorded as the first regular distribution curve, and the regular distribution curve of the uth non-representative attribute node is recorded as the second regular distribution curve; the ordinate value of the horizontal coordinate of the data point on the first data distribution curve is subtracted from the ordinate value of the horizontal coordinate of the first regular distribution curve, and the absolute value of the difference is obtained to obtain the difference value of this horizontal coordinate of the left node, which is recorded as the first difference value, and the difference value of the same horizontal coordinate of the right node is obtained by analogy, which is recorded as the second difference value, and the absolute value of the result obtained by subtracting the ratio of the first difference value and the second difference value of the same horizontal coordinate from 1 is obtained as the difference value of the current horizontal coordinate, and the corresponding average of the difference values of multiple horizontal coordinates is recorded as the difference between the left node and the right node, and the difference is normalized by an inverse proportional function, and the processing result is recorded as the edge weight of the two nodes, and the above operation is repeated to obtain the edge weights between other nodes;

Perform K-M matching on the bipartite graph, obtain the two nodes corresponding to the largest edge weight in the matching result, record them as the nodes to be combined, and combine the nodes to be combined to form a regular distribution curve.

9. According to the database data interface management method based on flexible configuration as described in claim 7, it is characterized in that the condition for judging whether to continue iteration is: calculating the size of the NMI normalized mutual information value of the first clustering result and the second clustering result. If it is greater than the set threshold, the iteration is stopped to obtain the final attribute combination; wherein, the first clustering result is the previous DBSCAN clustering iteration result, and the second clustering result is the next DBSCAN clustering iteration result of the previous one.