CN117931954A - Database data interface management method based on flexible configuration - Google Patents

Database data interface management method based on flexible configuration Download PDF

Info

Publication number
CN117931954A
CN117931954A CN202311496447.8A CN202311496447A CN117931954A CN 117931954 A CN117931954 A CN 117931954A CN 202311496447 A CN202311496447 A CN 202311496447A CN 117931954 A CN117931954 A CN 117931954A
Authority
CN
China
Prior art keywords
attribute
data
representative
value
distribution curve
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311496447.8A
Other languages
Chinese (zh)
Inventor
韩俊
蔡超
潘文婕
张文嘉
樊安洁
陈皓菲
王娜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Economic and Technological Research Institute of State Grid Jiangsu Electric Power Co Ltd
Original Assignee
Economic and Technological Research Institute of State Grid Jiangsu Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Economic and Technological Research Institute of State Grid Jiangsu Electric Power Co Ltd filed Critical Economic and Technological Research Institute of State Grid Jiangsu Electric Power Co Ltd
Priority to CN202311496447.8A priority Critical patent/CN117931954A/en
Publication of CN117931954A publication Critical patent/CN117931954A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a database data interface management method based on flexible configuration, which comprises the following steps: step 1) obtaining data with different attributes in a database; step 2) obtaining the representative degree and the reference degree of different attributes, and obtaining the representative attribute and the reference attribute; in the process of different iterations, constructing an attribute bipartite graph, and acquiring a final attribute combination; and 3) clustering the data fragments according to the final attribute combination, so as to realize accurate storage. The beneficial effects are that: according to the method, accurate fragment clustering is performed in a mode of obtaining the optimal attribute combination, so that the query efficiency is improved.

Description

Database data interface management method based on flexible configuration
Technical Field
The invention relates to the technical field of database data processing, in particular to a database data interface management method based on flexible configuration.
Background
In the application display scene of simulation calculation call and index data, the provision of basic data is extremely important. With the continuous deepening of simulation logic and index dimensions, the continuous increase of service demands, the increasing of access conditions and patterns of query results, the heavier the work of the code development interface, the more the workload, the serious influence on the use experience of customers, and a great deal of service loss. The traditional mode of developing the data source interface has large workload and is not easy to maintain, and lacks a unified mechanism for management, monitoring and abnormal rapid recovery, so that in order to reduce the workload of demand analysis, design, development, test and deployment, the development efficiency of the interface is greatly improved through a configurable method, low-code development is realized, and the database interface resources are unified managed.
In the context of large-scale concurrent queries based on a flexibly configured database data interface management system, such a system may experience some performance bottlenecks. For example, if many queries involve the same large table, this table may become a performance bottleneck, resulting in reduced query speed. To solve this problem, it is conventional to employ a data-slicing clustering technique. This technique divides the data in the database into several smaller subsets, called "shards". Each fragment can be independently subjected to query and update operations, so that the parallel processing capacity of the system is improved. Meanwhile, the data in the database is clustered and divided into a slice, so that the data locality can be further improved, and the query efficiency is improved. However, in the process of clustering the data fragments, if different attribute data are clustered respectively, different data fragments are obtained, a large amount of space redundancy is caused, and the query efficiency is reduced.
Disclosure of Invention
The invention aims to overcome the defects in the prior art, and provides a database data interface management method based on flexible configuration, which is realized by the following technical scheme:
The database data interface management method based on flexible configuration comprises the following steps:
Step 1) obtaining data with different attributes in a database;
step 2) obtaining the representative degree and the reference degree of different attributes, and obtaining the representative attribute and the reference attribute; in the process of different iterations, constructing an attribute bipartite graph, and acquiring a final attribute combination;
and 3) clustering the data fragments according to the final attribute combination, so as to realize accurate storage.
The database data interface management method based on flexible configuration is further designed in that in the step 2), in the process of obtaining the attribute combination mode, K-M matching is carried out on the bipartite graph, the corresponding K-M matching result is used as the attribute combination result, the attribute combination result in each iteration process is obtained in an iteration mode, and the final attribute combination mode is comprehensively obtained.
The database data interface management method based on flexible configuration is further designed in that the step 2) obtains the representative degree alpha j of the j-th attribute according to the formula (1),
In the formula (1), N j represents the number of data types of the jth attribute, max (N) represents the maximum value of the number of data types of all attributes, and J s' represents the number of all other attributes except the jth attribute; cv s(nj) represents the coefficient of variation value of the data value of the s-th attribute of the other non-j-th attribute when the j-th attribute data value is the n j -th data value;
The database data interface management method based on flexible configuration is further designed in that in the step 2), linear normalization processing is performed on the representative degrees of all the attributes, and the attribute corresponding to the maximum benchmark degree is selected and recorded as the benchmark attribute.
The database data interface management method based on flexible configuration is further designed in that the reference degree in the step 2) is determined by the overall distribution under the current attribute, specifically: obtaining truncated data when the representative attributes are the same, and the like, obtaining truncated data under other representative attributes, obtaining intersection of the truncated data under all representative attributes in all truncated data, wherein the intersection is first truncated data, calculating the reference degree gamma j of the jth attribute according to a formula (2) in each truncated data,
In formula (2), J s represents the number of all the other properties than the J-th property; m represents the number of all truncated data; r m(s) represents the Pearson correlation coefficient of the data in the mth truncated data under the s-th attribute and the data in the mth truncated data under the j-th attribute of other non-jth attribute.
The database data interface management method based on flexible configuration is further designed in that the process of acquiring the first truncated segment data is as follows: and arranging all the data in the database, taking the arranged position serial numbers as positioning, and performing segment cutting treatment on the data with the same representative attribute, so as to obtain the cut-off data under the current representative attribute.
The method for managing the database data interface based on flexible configuration is further designed in that the process of obtaining the final attribute combination in the step 2) specifically comprises the following steps:
Determining an attribute combination by constructing an attribute bipartite graph based on the acquired representative attribute and the reference attribute, wherein the bipartite graph is a bipartite graph when the representative attribute is the same value, wherein a left node of the bipartite graph is set to be any one non-representative attribute, a right node is set to be other non-representative attributes, and the same attribute in the connected edges between the left node and the right node is not connected, so that the corresponding edge weight between the acquired nodes can acquire the bipartite graph of the attribute;
Matching is carried out in an iterative mode, and the iterative process is as follows: in the first iteration process, carrying out K-M matching of the bipartite graph, after obtaining attribute combinations of matching results, calculating the difference between data clustered according to the attribute combinations and data clustered before, and stopping iteration if the difference is smaller than a preset difference threshold; based on the obtained representative attribute, taking the reference attribute as an overall reference, and calculating a data distribution curve of the attribute represented by the left node and the right node when the representative attribute is the same;
Selecting the d data value of the representative attribute, wherein the left node is the v non-representative attribute node, the abscissa of the data distribution curve of the left node is the reference attribute, the ordinate is the v non-representative attribute, and the data distribution curve is marked as the first data distribution curve; the right node is the u non-representative attribute node, the data distribution abscissa of the right node is the reference attribute, the ordinate is the u non-representative attribute, the data distribution abscissa is recorded as a second data distribution curve, the difference between the data point on each data distribution curve and the regular distribution curve of the corresponding attribute is adopted, and the similarity of the difference is calculated to serve as the edge weight of the edge weight.
The database data interface management method based on flexible configuration is further designed in that the acquisition process of the regular distribution curve is specifically as follows: setting the abscissa of the regular distribution curve as a datum attribute, the ordinate as a v non-representative attribute, and calculating the point of each existing abscissa to determine the data value of the corresponding ordinate, namely calculating the ordinate value y q corresponding to the q-th abscissa value according to the formula (3):
In the formula (3), H represents the number of clusters after DBSCAN clustering according to the q-th benchmark attribute; w h represents the number of data types in the h cluster; p w represents the frequency value of occurrence of the w-th data category; g w represents the data value of the w-th data class; Representing the distribution characteristics of the clusters where the w data types in the h cluster are located, wherein the distribution characteristics are represented by the density among the data in the clusters; the ordinate value is obtained by weighted average, by/> Weights representing the data types;
According to the obtained ordinate value corresponding to the abscissa, a continuous distribution curve is obtained, the distribution curve is fitted to obtain a corresponding regular distribution curve, the v th regular distribution curve of the non-representative attribute node is marked as a first regular distribution curve, and the u th regular distribution curve of the non-representative attribute node is marked as a second regular distribution curve; taking the difference between the ordinate value of the abscissa of the data point on the first data distribution curve and the ordinate value of the abscissa of the first regular distribution curve, obtaining the difference value of the abscissa of the left node by the absolute value of the obtained difference value, marking the difference value of the same abscissa of the right node as a first difference value, marking the difference value of the same abscissa as a second difference value, obtaining the absolute value of the result obtained by subtracting 1 from the first difference value of the same abscissa, marking the average value of the obtained difference values of a plurality of abscissas as the difference between the left node and the right node, carrying out inverse proportion function normalization processing on the difference, marking the processed result as the edge weight of the two nodes, and repeating the operations to obtain the edge weight between other nodes; performing K-M matching on the bipartite graph, obtaining two nodes corresponding to the maximum edge weight in the matching result, marking the two nodes as nodes to be combined, and combining the nodes to be combined to form a regular distribution curve
The database data interface management method based on flexible configuration is further designed in that the condition for judging whether iteration needs to be continued is as follows: calculating NMI normalized mutual information values of the first clustering result and the second clustering result, and stopping iteration if the NMI normalized mutual information values are larger than a set threshold value, so as to obtain a final attribute combination; the first clustering result is the previous DBSCAN clustering iteration result, and the second clustering result is the previous next DBSCAN clustering iteration result.
The invention has the following advantages:
According to the invention, accurate fragment clustering is performed by obtaining the optimal attribute combination, so that the query efficiency is improved. The representative attribute is obtained according to the data change characteristics of the same attribute, and meanwhile, the benchmark attribute is obtained according to the overall distribution of the data, so that bipartite graphs are established in different iterative processes. And calculating the edge weight of the obtained bipartite graph, acquiring the similarity between accurate nodes according to the reference attribute in the process of acquiring the edge weight, further accurately acquiring the edge weight, further acquiring the optimal attribute combination according to the K-M matching result of the bipartite graph, and performing the slicing clustering of the database according to the optimal attribute combination to acquire the accurate slicing result, thereby improving the query efficiency of the management system.
Detailed Description
The following describes the technical scheme of the invention in detail.
Step 1 of the present embodiment: data of different attributes in the database is acquired.
In this case, data with different attributes in the database are obtained, wherein the management system is mainly used for managing power data, so that the embodiment is described by using the power data, wherein the data with different attributes includes a user ID, a user name, time, a power consumption type, electric quantity, electric charge and the like, and one piece of data contains the data with different attributes. Wherein, for the convenience of calculation, the data of the same attribute is subjected to linear normalization processing.
Step 2: and obtaining the representative degree and the reference degree of the different attributes, and obtaining the representative attribute and the reference attribute. And in the process of different iterations, constructing an attribute bipartite graph, and acquiring a final attribute combination.
It should be noted that, in the conventional data slicing process, the data locality can be further improved by clustering the data, so as to improve the query efficiency. However, in the process of clustering the data fragments, if different attribute data are clustered respectively, data fragment results under different attributes are obtained, however, in the data fragment results, certain repeatability exists in the obtained different data fragments due to certain relativity among the attributes, namely, a large amount of repetition exists among certain corresponding data fragments, so that space redundancy is caused, and query efficiency is also reduced. In order to accurately cluster data, the method is expected to change from clustering from original single attributes to clustering through multiple attributes in an attribute combination mode, so that the method achieves accurate clustering and ensures the query efficiency. In the process of acquiring the attribute combination, as the data of different attributes contains different information contents of the whole data and the standard degree of the change of the attribute data on the whole data is different (namely, the corresponding change of the attribute data in certain attributes is not influenced by other attribute data, then the attribute data can represent the standard distribution of the whole data), the accuracy of data clustering can be improved according to the attribute data as the representation condition of the attribute combination.
It should be further noted that, in the process of obtaining the attribute combination mode, a bipartite graph processing method is adopted to perform K-M matching on the bipartite graph, and the corresponding K-M matching result is a suitable attribute combination result. And obtaining an attribute combination result in each iteration process in an iteration mode, and comprehensively obtaining a final attribute combination mode.
Step 2-1) obtaining the representative degree and the reference degree of different attributes, and obtaining the representative attribute and the reference attribute.
According to all data in the database, the data change characteristics of the same attribute show the representative degree and the reference degree of the attribute. Wherein for the representative degree of the attribute data, the representative degree characterizes the information capability of the attribute, wherein the calculation method of the representative degree a j of the j-th attribute is as follows:
Where N j represents the number of data types (how many different data values occur) for the jth attribute; max (N) represents the maximum value of the number of data types of all attributes; j s denotes the number of all other attributes except the jth attribute;
Where cv s(nj) represents the coefficient of variation value of the data value of the s-th attribute of the other non-j-th attribute when the j-th attribute data value is the n j -th data value. The value represents the data distribution of other attributes when the jth attribute value is unique, and the value is represented to be less representative if the data distribution of other attributes is more discrete (the variation coefficient is larger).
And carrying out linear normalization processing on the representative degrees of all the attributes, and selecting the attribute corresponding to the maximum benchmark degree as the benchmark attribute.
Wherein for a degree of benchmarking of attribute data, the degree of benchmarking characterizes an overall degree of benchmarking of overall data for that attribute. Wherein the degree of benchmarking is determined by the overall distribution of the attribute, and therefore it is necessary to acquire first truncated data when the representative attributes are the same, and determine the degree of benchmarking of the attribute in the data, wherein the process of acquiring the first truncated data is: and arranging all data in the database, and taking the arranged position serial numbers as positioning. According to the steps, all the representative attributes are obtained, and taking a certain representative attribute as an example, the data with the same representative attribute is subjected to the cut-off processing (particularly, the data is not necessarily arranged continuously after the cut-off processing), so that cut-off data under the representative attribute is obtained; similarly, truncated data under other representative attributes may be obtained, and in all of these truncated data, intersections of these data are obtained, where these intersections are the first truncated data. In each piece of cutoff data, the degree of benchmarking gamma j of the jth attribute is calculated by the following calculation method:
Wherein J s represents the number of all other attributes than the jth attribute; m represents the number of all truncated data; r m(s) represents the pearson correlation coefficient between the data in the mth truncated data under the s attribute of other non-jth attribute and the data in the mth truncated data under the j attribute. And if the Pearson correlation coefficient between the data of the jth attribute and the truncated data of other non-jth attributes in the same truncated data is smaller, the change of the data of the jth attribute is indicated not to influence the change of the data of other attributes, and the reference line degree of the corresponding jth attribute is larger.
And similarly, carrying out linear normalization processing on the datum levels of all the attributes, and selecting the attribute corresponding to the maximum datum level as the datum attribute.
Thus, the representative degree and the reference degree of different attributes are obtained, and the representative attribute and the reference attribute are obtained.
Step 2-2) constructing an attribute bipartite graph in the process of different iterations, and obtaining a final attribute combination.
And obtaining the representative attribute and the reference attribute according to the steps, wherein the representative attribute has strong representativeness, the representative attribute is correspondingly in the attribute combination, wherein an attribute bipartite graph is constructed to determine the attribute combination, the bipartite graph is a bipartite graph (a plurality of corresponding bipartite graphs exist) when the representative attribute is a certain same value, the left node of the bipartite graph is any one non-representative attribute, the right node is also other non-representative attributes, the same attribute in the connected edges between the left node and the right node is not connected, and the corresponding edge weight between the obtained nodes can obtain the bipartite graph of the attribute. In order to ensure the optimal attribute combination, an iterative mode is adopted for matching, namely, the iterative process is as follows: in the first iteration process, K-M matching of the bipartite graph is performed. After the attribute combination of the matching result is obtained, calculating the difference between the data clustered according to the attribute combination and the data clustered before, presetting a difference threshold value to be 0.45, and if the difference is smaller than the difference threshold value, indicating that iteration needs to be stopped. The iteration process is to judge whether the number of attribute combinations needs to be increased, and the difference threshold value can be determined according to specific implementation situations of an implementer.
Wherein the edge weight characterizes a similarity relationship between the left node and the right node. Because a plurality of different attributes exist, the corresponding clustering result according to the attribute corresponding to the left node cannot be considered when the similarity is calculated, and the similarity of the clustering result according to the attribute corresponding to the right node, namely, the content of the representation of the corresponding clustering result according to the single attribute on the different attributes is different, so that the result obtained by the similarity calculation is inaccurate. Therefore, the invention calculates the data distribution curves of the attributes represented by the left node and the right node when the same representative attribute is calculated by taking the benchmark attribute as the whole benchmark on the basis of the acquired representative attribute (wherein the data distribution is embodied in a coordinate system mode).
The method comprises the steps of selecting a d-th data value of a representative attribute, wherein a left node is a v-th non-representative attribute node, the abscissa of a data distribution curve of the left node is a reference attribute, and the ordinate is a v-th non-representative attribute and is recorded as a first data distribution curve; the right node is the u non-representative attribute node, the data distribution abscissa of the right node is the benchmark attribute, the ordinate is the u non-representative attribute, and the data distribution abscissa is recorded as the second data distribution curve. Because the first data distribution curve and the second data distribution curve are discrete, the invention adopts the difference between each data point (the data point on the data distribution curve) and the corresponding attribute regular distribution curve, and further calculates the similarity of the difference as the side weight of the side weight.
The process of acquiring a regular distribution curve is described by taking a v-th non-representative attribute node as an example: the abscissa of the regular distribution curve is also the reference attribute, and the ordinate is the v-th non-representative attribute. For the abscissa, all data values of the reference attribute are acquired, and each existing point of the abscissa is calculated to determine the corresponding data value of the ordinate, wherein the calculation method of the ordinate value y q corresponding to the q-th abscissa value is as follows:
Wherein H represents the number of clusters after DBSCAN clustering according to the q-th benchmark attribute; w h represents the number of data types in the h cluster; p w represents the frequency value of occurrence of the w-th data category; g w represents the data value of the w-th data class;
In the method, in the process of the invention, The distribution characteristics of the clusters where the w-th data species in the h-th cluster are located are represented by the density between the data in the clusters. Wherein the ordinate value is obtained by a weighted average, wherein by/>And the weight of the data type is represented, if the clusters of the data are scattered, the representation of the data type is random, and the weight of the corresponding data type is smaller.
And (3) according to the ordinate value corresponding to the obtained abscissa, obtaining a relatively continuous distribution curve, and fitting the distribution curve to obtain a corresponding regular distribution curve, wherein the regular distribution curve of the left node (the v-th non-representative attribute node) is marked as a first regular distribution curve, and the regular distribution curve of the right node (the u-th non-representative attribute node) is marked as a second regular distribution curve.
Correspondingly calculating the difference between the first data distribution curve and the first regular distribution curve and the difference between the second data distribution curve and the second regular distribution curve, wherein when the difference is calculated, the ordinate value of the abscissa of the data point on the first data distribution curve is different from the ordinate value of the abscissa of the first regular distribution curve, the obtained absolute value of the difference obtains the difference value of the abscissa of the left node, and the difference value is recorded as a first difference value (comprising a plurality of first difference values, one abscissa corresponds to one), and similarly, the difference value of the same abscissa of the right node is obtained and recorded as a second difference value (also comprising a plurality of second difference values, one abscissa corresponds to one). And obtaining a result obtained by subtracting 1 from the ratio of the first difference value and the second difference value of the same abscissa, wherein the absolute value of the result is the difference value of the abscissa, and correspondingly obtaining the average value of the difference values of a plurality of abscissas to be recorded as the difference between the left node and the right node. And performing inverse proportion function normalization processing on the difference to obtain an edge weight marked as the two nodes.
Similar to the above operation, edge weights between other nodes can be obtained. And carrying out K-M matching on the bipartite graph, obtaining two nodes corresponding to the maximum edge weight value in a matching result, and marking the two nodes as nodes to be combined. The nodes to be combined are combined (like a binary group). The conditions for judging whether iteration needs to be continued are as follows: and for the second time, performing DBSCAN clustering on the data according to the attribute combination in the iteration process, and marking the DBSCAN clustering result as a second clustering result, wherein the first clustering result is the first clustering result in the previous iteration process. And calculating the difference between the first clustering result and the second clustering result, wherein NMI (Normalized Mutual Information) of the two clustering results is calculated, the size of the mutual information value is normalized, a threshold value of 0.65 is set, and if the size is larger than the threshold value, iteration is stopped, and the final attribute combination is obtained.
And step 3, clustering the data fragments according to the final attribute combination, and realizing accurate storage.
According to the steps, the final attribute combination is obtained, the result of the sliced clustering is obtained according to the condition that the data value corresponding to the attribute combination is used as the sliced clustering of the data, the corresponding data of the sliced clustering result is stored, and the index is constructed so as to facilitate the inquiry.
The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims (9)

1. A database data interface management method based on flexible configuration, characterized by comprising:
Step 1) obtaining data with different attributes in a database;
step 2) obtaining the representative degree and the reference degree of different attributes, and obtaining the representative attribute and the reference attribute; in the process of different iterations, constructing an attribute bipartite graph, and acquiring a final attribute combination;
and 3) clustering the data fragments according to the final attribute combination, so as to realize accurate storage.
2. The flexible configuration-based database data interface management method according to claim 1, wherein in the step 2), in the process of obtaining the attribute combination mode, the binary image is subjected to K-M matching, the corresponding K-M matching result is used as the attribute combination result, and the attribute combination result in each iteration process is obtained in an iterative mode, so that the final attribute combination mode is comprehensively obtained.
3. The flexible configuration-based database data interface management method as claimed in claim 1, wherein the step 2) obtains the representative degree α j of the j-th attribute according to equation (1),
In the formula (1), N j represents the number of data types of the jth attribute, max (N) represents the maximum value of the number of data types of all attributes, and J s' represents the number of all other attributes except the jth attribute; cv s(nj) represents the coefficient of variation value of the data value of the s-th attribute of the other non-j-th attribute when the j-th attribute data value is the n j -th data value.
4. The method for managing the database data interface based on flexible configuration according to claim 3, wherein in the step 2), the representative degrees of all the attributes are subjected to linear normalization processing, and the attribute corresponding to the maximum reference degree is selected as the reference attribute.
5. A method for managing a database data interface based on flexible configuration according to claim 3, characterized in that said reference level in step 2) is determined by the distribution of the whole under the current attribute, specifically: obtaining truncated data when the representative attributes are the same, and the like, obtaining truncated data under other representative attributes, obtaining intersection of the truncated data under all representative attributes in all truncated data, wherein the intersection is first truncated data, calculating the reference degree gamma j of the jth attribute according to a formula (2) in each truncated data,
In formula (2), J' s represents the number of all the other properties other than the jth property; m represents the number of all truncated data; r m(s) represents the Pearson correlation coefficient of the data in the mth truncated data under the s-th attribute and the data in the mth truncated data under the j-th attribute of other non-jth attribute.
6. The flexible configuration-based database data interface management method of claim 5, wherein the process of obtaining the first truncated segment data is: and arranging all the data in the database, taking the arranged position serial numbers as positioning, and performing segment cutting treatment on the data with the same representative attribute, so as to obtain the cut-off data under the current representative attribute.
7. The flexible configuration-based database data interface management method according to claim 5, wherein the process of obtaining the final attribute combination in step 2) is specifically:
Determining an attribute combination by constructing an attribute bipartite graph based on the acquired representative attribute and the reference attribute, wherein the bipartite graph is a bipartite graph when the representative attribute is the same value, wherein a left node of the bipartite graph is set to be any one non-representative attribute, a right node is set to be other non-representative attributes, and the same attribute in the connected edges between the left node and the right node is not connected, so that the corresponding edge weight between the acquired nodes can acquire the bipartite graph of the attribute;
Matching is carried out in an iterative mode, and the iterative process is as follows: in the first iteration process, carrying out K-M matching of the bipartite graph, after obtaining attribute combinations of matching results, calculating the difference between data clustered according to the attribute combinations and data clustered before, and stopping iteration if the difference is smaller than a preset difference threshold; based on the obtained representative attribute, taking the reference attribute as an overall reference, and calculating a data distribution curve of the attribute represented by the left node and the right node when the representative attribute is the same;
Selecting the d data value of the representative attribute, wherein the left node is the v non-representative attribute node, the abscissa of the data distribution curve of the left node is the reference attribute, the ordinate is the v non-representative attribute, and the data distribution curve is marked as the first data distribution curve; the right node is the u non-representative attribute node, the data distribution abscissa of the right node is the reference attribute, the ordinate is the u non-representative attribute, the data distribution abscissa is recorded as a second data distribution curve, the difference between the data point on each data distribution curve and the regular distribution curve of the corresponding attribute is adopted, and the similarity of the difference is calculated to serve as the edge weight of the edge weight.
8. The method for managing the database data interface based on the flexible configuration according to claim 7, wherein the process of obtaining the regular distribution curve is specifically as follows: setting the abscissa of the regular distribution curve as a datum attribute, the ordinate as a v non-representative attribute, and calculating the point of each existing abscissa to determine the data value of the corresponding ordinate, namely calculating the ordinate value y q corresponding to the q-th abscissa value according to the formula (3):
In the formula (3), H represents the number of clusters after DBSCAN clustering according to the q-th benchmark attribute; w h represents the number of data types in the first cluster; p w represents the frequency value of occurrence of the w-th data category; g w represents the data value of the w-th data class; Representing the distribution characteristics of the clusters where the w-th data category exists in the first cluster, wherein the distribution characteristics are represented by the density among the data in the clusters; the ordinate value is obtained by weighted average, by/> Weights representing the data types;
According to the obtained ordinate value corresponding to the abscissa, a continuous distribution curve is obtained, the distribution curve is fitted to obtain a corresponding regular distribution curve, the v th regular distribution curve of the non-representative attribute node is marked as a first regular distribution curve, and the u th regular distribution curve of the non-representative attribute node is marked as a second regular distribution curve; taking the difference between the ordinate value of the abscissa of the data point on the first data distribution curve and the ordinate value of the abscissa of the first regular distribution curve, obtaining the difference value of the abscissa of the left node by the absolute value of the obtained difference value, marking the difference value of the same abscissa of the right node as a first difference value, marking the difference value of the same abscissa as a second difference value, obtaining the absolute value of the result obtained by subtracting 1 from the first difference value of the same abscissa, marking the average value of the obtained difference values of a plurality of abscissas as the difference between the left node and the right node, carrying out inverse proportion function normalization processing on the difference, marking the processed result as the edge weight of the two nodes, and repeating the operations to obtain the edge weight between other nodes;
And carrying out K-M matching on the bipartite graph, obtaining two nodes corresponding to the maximum edge weight value in a matching result, marking the two nodes as nodes to be combined, and combining the nodes to be combined to form a regular distribution curve.
9. The flexible configuration-based database data interface management method of claim 7, wherein the condition for determining whether iteration needs to be continued is: calculating NMI normalized mutual information values of the first clustering result and the second clustering result, and stopping iteration if the NMI normalized mutual information values are larger than a set threshold value, so as to obtain a final attribute combination; the first clustering result is the previous DBSCAN clustering iteration result, and the second clustering result is the previous next DBSCAN clustering iteration result.
CN202311496447.8A 2023-11-10 2023-11-10 Database data interface management method based on flexible configuration Pending CN117931954A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311496447.8A CN117931954A (en) 2023-11-10 2023-11-10 Database data interface management method based on flexible configuration

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311496447.8A CN117931954A (en) 2023-11-10 2023-11-10 Database data interface management method based on flexible configuration

Publications (1)

Publication Number Publication Date
CN117931954A true CN117931954A (en) 2024-04-26

Family

ID=90758055

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311496447.8A Pending CN117931954A (en) 2023-11-10 2023-11-10 Database data interface management method based on flexible configuration

Country Status (1)

Country Link
CN (1) CN117931954A (en)

Similar Documents

Publication Publication Date Title
US10331712B2 (en) Efficient spatial queries in large data tables
CN108683530B (en) Data analysis method and device for multi-dimensional data and storage medium
US9135280B2 (en) Grouping interdependent fields
Hu et al. Distance indexing on road networks
WO2017096892A1 (en) Index construction method, search method, and corresponding device, apparatus, and computer storage medium
CN106919957B (en) Method and device for processing data
US20160342667A1 (en) Managing database with counting bloom filters
Yu et al. Two birds, one stone: a fast, yet lightweight, indexing scheme for modern database systems
CN117951118B (en) Geotechnical engineering investigation big data archiving method and system
CN114610825A (en) Method and device for confirming associated grid set, electronic equipment and storage medium
CN112100177A (en) Data storage method and device, computer equipment and storage medium
CN117931954A (en) Database data interface management method based on flexible configuration
US11748255B1 (en) Method for searching free blocks in bitmap data, and related components
US20210248142A1 (en) Dual filter histogram optimization
CN111931861B (en) Anomaly detection method for heterogeneous data set and computer-readable storage medium
CN116976574A (en) Building load curve dimension reduction method based on two-stage hybrid clustering algorithm
CN114565031A (en) Vehicle fleet identification method and device based on longitude and latitude and computer equipment
CN110321435B (en) Data source dividing method, device, equipment and storage medium
CN112100670A (en) Big data based privacy data grading protection method
KR20110080966A (en) An associative classification method for detecting useful knowledge from huge multi-attributes dataset
CN112527622A (en) Performance test result analysis method and device
CN114579573B (en) Information retrieval method, information retrieval device, electronic equipment and storage medium
CN114238258B (en) Database data processing method, device, computer equipment and storage medium
CN116541252B (en) Computer room fault log data processing method and device
CN110096529B (en) Network data mining method and system based on multidimensional vector data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination