CN111444949A

CN111444949A - Rule optimization-based data-driven granularity modeling method

Info

Publication number: CN111444949A
Application number: CN202010209091.5A
Authority: CN
Inventors: 胡星辰; 李妍; 陈超; 程光权; 吴克宇; 杜航; 孙博良; 黄金才; 刘忠
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2020-03-23
Filing date: 2020-03-23
Publication date: 2020-07-24

Abstract

The invention belongs to the technical field of information processing, and discloses a data-driven granularity modeling method based on rule optimization. Firstly, establishing a data-driven zero-order and first-order TS fuzzy model based on a clustering method, and then adjusting the position of a clustering center by using an optimization algorithm to replace a non-directional traditional data clustering method, so as to realize rule optimization adjustment and data internal structure information learning; then, generating information granularity with specific semantics, and expanding the clustering center to an interval information granule form with stronger robustness so as to effectively divide the whole output space; and then, the distribution of the information particles is adjusted by applying an optimization algorithm, so that the model is output in intervals of the information particles, and finally, a new data-driven particle size modeling method is formed. Experiments carried out on artificial and real data sets prove the effectiveness of the method.

Description

Rule optimization-based data-driven granularity modeling method

Technical Field

The invention belongs to the technical field of information processing, and particularly relates to a data-driven granularity modeling method based on rule optimization.

Background

In recent years, with the rapid development of technologies such as informatization, mobile interconnection, internet of things and the like, the development of intelligent technologies such as knowledge discovery and reasoning, big data analysis and mining and the like is urgently needed, and an artificial intelligence modeling method is one of the most important tools for complex data mining and analysis in different fields. At present, a series of modeling technologies based on fuzzy rules are widely applied to various fields as a classical artificial intelligence modeling method. As two typical topologies, Mamdani and TS construct information particles (fuzzy sets) in the condition and conclusion part of the fuzzy rule model, and form a nonlinear mapping in the input-output space. In contrast to the Mamdani model, the TS fuzzy model forms a rule-based system that describes complex non-linear relationships and can output numerical results directly rather than in the form of fuzzy outputs. In the construction of the TS fuzzy rule model, obtaining the relation between a fuzzy set and a rule is a core problem of modeling. With respect to the design of fuzzy rules, there are two common directions, one is expert knowledge driven rule acquisition and the other is data driven rule acquisition. In the case of low dimensional input space, the fuzzy set may be specified by human experience (expert knowledge), but in the case of large data background, in the case of high dimensional data, data driven methods are often required to obtain the rules. One of the feasible and important solutions is to determine the fuzzy set by fuzzy clustering. Fuzzy clustering is therefore crucial for data-driven fuzzy rule-based modeling, where fuzzy C-means clustering (FCM) methods and their variants are often used in practical systems due to their prominent semantic and soft clustering features. Existing optimization methods of such fuzzy rule modeling methods are related to the main parameters of FCM, namely (1) the number of clustering rules (number of clustering centers (prototypes) c) and (2) the value of the fuzzy coefficient (m). The accuracy of the model improves as the number of clusters increases. However, the number of rules should not be too high in view of computational performance and computer memory limitations during the modeling process. In the formation of the model output, the fuzzification coefficients form respective levels of interaction between the rules. Assuming that the blurring coefficient is close to 1, the membership function is similar to the eigen function, and for the model, each region in the input space will influence the output according to the rules present in the corresponding region. As the value of the fuzzification coefficient increases, the rules tend to interact more significantly and help determine the output. In general, the adjustment of two basic parameters is not sufficient, and the method still has great potential in improving the optimization problem of the fuzzy model.

Among many methods for parameter improvement of FCM, a method for processing FCM clustering results has not been proposed yet. The use of fuzzy clustering in system modeling is reasonable, and at the same time it is important to reveal the spatial structure of data input and output simultaneously during the clustering process. The average effect of the cluster centers generated by the FCM on the data space distribution is inevitable, and the determination of the cluster centers is also non-directional and is not related to the mapping relationship of the data input and output space, which means that the output range modeled by the rule may be reduced, and the accuracy of the mathematical model may be reduced. On the other hand, in the data-driven modeling process, due to data errors, noise and the like, parameters (such as a cluster center) forming a model rule inevitably have errors and disturbances, so that the accuracy and robustness of the final output of the whole mathematical model are affected. Therefore, the fuzzy rule can be obviously optimized through the optimization of the clustering center and the configuration of the information particles around the clustering center, so that the granularity fuzzy rule modeling method with higher accuracy, robustness and fault tolerance is formed.

The difficulty of solving the technical problems is as follows:

1. the optimization method of the clustering center and the relevant fuzzy rule thereof is designed and meets the following three requirements: the optimized clustering center has determinism and is associated with a data space structure; the cluster center updating process in the data space can avoid overfitting caused by excessive pursuit of precision as far as possible; can adapt to the non-convex characteristic and has better convergence speed.

2. And (3) reasoning and calculation of the granularity rule and the granularity model: the granularity configuration strategy of the optimized clustering center is scientifically designed, which is important for improving the performance of the granularity model, and meanwhile, an inference calculation method under different information particle spatial relations needs to be fully considered.

3. The rationality evaluation method of the information particles comprises the following steps: the information particle output cannot be evaluated by adopting a numerical value driven performance index (such as RMSE or MSE), and an evaluation index system suitable for the information particle form and a corresponding evaluation function need to be reasonably designed, and an optimization method of information particle configuration needs to be designed according to the coupling and correlation among a plurality of performance evaluation indexes.

The significance of solving the technical problems is as follows: fuzzy theory is one of the easiest to understand and most acceptable perspectives for the discovery and exploration of the world. Human perception and cognition in natural language usually has insignificant (ambiguous) boundaries. By mimicking human cognitive and behavioral habits, an important contribution of fuzzy rule modeling is to provide a gradual and diverse basis, and thus data can be well described and expressed. The invention mainly solves the problem that the non-directional property of the clustering method in the modeling process of the fuzzy rule can not completely support the modeling of the data-oriented characteristic by the rule structure serving as the clustering result, so as to improve the accuracy of the model, and the granularity fuzzy rule modeling with robustness and fault tolerance is realized by the configuration of the information particles around the clustering center. The method has great significance and advancement for describing a complex nonlinear system in the real world, particularly in a big data environment.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a data-driven granularity modeling method based on rule optimization.

The invention is realized in such a way that a data-driven granularity modeling method based on rule optimization, the fuzzy rule model rule optimization and information granularity distribution method comprises the following steps:

the method comprises the steps of firstly, acquiring fuzzy rules for mathematical model construction based on a data clustering method, constructing zero-order and first-order TS fuzzy rule models, and realizing mapping of a data input space and a data output space;

secondly, optimizing the distribution of the clustering centers in the data space by a group optimization algorithm to realize the optimization of the fuzzy rule with the aim of improving the mapping precision of the model to the data input space and the data output space;

the third step: designing a non-uniform distribution strategy, and expanding the optimized clustering center into hypercube information particles with higher robustness and fault tolerance, so as to form a model for outputting the information particles in an interval form;

the fourth step: and aiming at the output rationality of the granularity model, the configuration of information particles around the clustering center is further optimized by adopting the coverage rate and the specificity evaluation index, so that the optimization and modeling of the granularity rule are realized.

Further, the fuzzy rule-based TS model realizes identification of the precondition parameters in the rule through a fuzzy C-means clustering (FCM) algorithm and expands the precondition parameters by using a data-driven method. In the rule-based TS fuzzy model, the local models in the rule are smoothly connected by the membership function obtained by FCM, so that a complete global fuzzy model is formed. Aiming at different parameters of a theory part in a rule, the invention establishes a zero-order and a first-order TS fuzzy model to realize the mapping of a data input space and a data output space.

Further, the model rule optimization method based on the data structure realizes optimization of a parameter clustering center through FCM and Particle Swarm Optimization (PSO) from the perspective of data driving. The optimization principle is as follows: the optimized clustering center has determinism and is associated with a data space structure; the cluster center updating process in the data space can avoid overfitting caused by excessive pursuit of precision as far as possible; can adapt to the non-convex characteristic and has better convergence speed. The PSO optimizes the distribution of the clustering center through group cooperation and information sharing, RMSE is used for evaluating the performance of different strategies before and after the optimization of the clustering center, Wilcoxrank sum test analyzes the significance of the optimization result and returns a positive scalar value p and a logical value h as test results.

Particle velocity Vel in PSO optimization_idAnd position Pos_idThe update of (a) is specifically expressed as follows:

Vel_id＝αVel_id+ζ₁r₁(P_id-Pos_id)+ζ₂r₂(P_gd-Pos_id)

Pos_id＝Pos_id+Vel_id

α is called an inertia factor (α > 0) reflects the "habit" of particle motion, balancing the global and local optimization behavior by adjusting the size of α, ζ₁And ζ₂Is a learning factor, also known as an acceleration constant; r is₁And r₂Is [0, 1 ]]A uniform random number within a range; p_idD-dimension, P, representing respective extrema of i-th variable_gdThe d-th dimension representing the global optimal solution. In order to improve the optimization efficiency, the invention sets the upper limit and the lower limit of the search space for the moving particles in the PSO.

Further, the information granularity distribution method based on the fuzzy rule model selects a non-uniform distribution strategy to optimize the clustering centers v before and after optimization_iExpanding the hypercube V into a hypercube V with more robustness and fault tolerance_i. The interval lengths distributed by different dimensions of each clustering center are different, and the constraint of the overall balance of granularity information is met:

wherein the content of the first and second substances,

variables of

And

the method comprises the following steps:

further, two performance indicators, coverage (specificity) and specificity (coverage), related to information granularity, are selected based on a granularity model of the TS fuzzy rule model.

The coverage rate is as follows: the index is used for judging the accuracy of the output of the granularity model, and the Coverage represents the granularity Y of the information generated by the model_k"overlay" y_kDegree of (c):

the higher the value of the coverage rate, the better the model accuracy; evidence of a coverage of 1_kAll values of (a) are contained in an interval of information granularity;

the specificity is as follows: the index describes the granularity of information Y_kIn the level of detail, specificity is Y_kA decreasing function of the interval length, increasing with decreasing interval; when the information granularity of the distribution is 0, Y_kThe degeneration becomes a point at which the specificity reaches a maximum value, and the specificity is determinedMeaning as follows:

wherein

Is the output interval determined by the decision; in addition to the exponential function described above, any continuously decreasing function of interval length is feasible as long as the marginal condition is met: when the interval length is 0, spe has a value of 1. Both the specificity and coverage indices are constrained, represented by cov () and spe (). In the optimization problem of optimal information granularity distribution, an evaluation index is defined as:

Φ()＝cov()·spe()；

simultaneously, a curve in cov-spe coordinates is drawn and the area under the curve, AUC, is calculated, the expression is as follows:

where s is the number of set points.

Further, the definition granularity rule of the fuzzy rule model-based information granularity allocation method is as follows:

wherein capital letters emphasize corresponding components in the description rules in the form of information granularity,

granularity rule-based interval nonlinear membership function and sign for representing clustering center

And

calculating a representation interval; the input being a number in an n-dimensional spaceObtaining the ith v through a first-order fuzzy model according to the set x_iI is the number of rules; meanwhile, a hypercube clustering center is constructed by adopting a non-uniform distribution strategy! | A

⊥V_i(ii) a Projecting hypercubes onto respective coordinate axes! | A

⊥e_iAnd in the same j-th dimension, the farthest and closest distances of the projection interval of the vector to the hypercube are calculated. The calculation formula is as follows:

further, the farthest and closest distances of the projected sections of the vector projected to the hypercube are calculated by the following formula:

the calculation formula of the numerical vector and the hypercube clustering center and the interval membership function are calculated as follows:

wherein m is more than 1;

the output interval is calculated as follows:

the output information particles are evaluated through the coverage rate and the specific performance indexes and used as an objective function to feed back and optimize the fuzzy rule modeling of the granularity so as to obtain a model with more excellent performance.

The invention aims to provide application of the rule optimization-based data-driven granularity modeling method in data mining and knowledge discovery.

In summary, the advantages and positive effects of the invention are: the granularity fuzzy modeling method is an extension of the classical fuzzy modeling technology, and how to determine a fuzzy set is a core problem. Introducing a clustering center optimization concept into the granularity fuzzy model, firstly establishing a zeroth-order fuzzy model and a first-order fuzzy model, adjusting the position of a clustering center by using an optimization algorithm, and replacing a non-directional traditional data clustering method, thereby learning the internal structure information of data during parameter identification; then, by generating information granularity with specific semantics, the clustering center is expanded to an interval form with stronger robustness, and the whole output space is divided. The invention adjusts the distribution of the information particles by applying an optimization algorithm, improves a data-driven granulation model, puts emphasis on the interval information granularity, and analyzes the attributes of the interval information granularity without loss of generality. Furthermore, the concept of granulation of (digital) cluster centers is introduced, making its noise and errors more robust. Experiments performed on both manual and real data sets demonstrate the effectiveness of the method of the invention.

Drawings

Fig. 1 is a flowchart of an information granularity allocation method based on a fuzzy rule model according to an embodiment of the present invention.

FIG. 2 is a schematic diagram of a two-dimensional granular clustering center and a granular clustering center constraint function for each dimension of data according to an embodiment of the present invention;

in the figure: (a) a two-dimensional granular clustering center; (b) an upper bound constraint function; (c) a lower bound constraint function.

Fig. 3 shows the cluster center before (triangle) and after (rectangle) optimization positions provided by the embodiment of the present invention: c is a 3 schematic diagram; the blue point is the entire data distribution, the red triangle represents the cluster center before optimization, the yellow rectangle is the cluster center after optimization, and the arrow indicates the change in the position of the cluster center.

In the figure: (a) a zero order fuzzy model; (b) a first order fuzzy model.

Fig. 4 shows the cluster center before (triangle) and after (rectangle) optimization positions provided by the embodiment of the present invention: c-5 schematic diagram; the blue point is the entire data distribution, the red triangle represents the cluster center before optimization, the yellow rectangle is the cluster center after optimization, and the arrow indicates the change in the position of the cluster center.

In the figure: (a) a zero order fuzzy model; (b) a first order fuzzy model.

Fig. 5 shows the cluster center before (triangle) and after (rectangle) optimization positions provided by the embodiment of the present invention: c is 7 schematic diagram; the blue point is the entire data distribution, the red triangle represents the cluster center before optimization, the yellow rectangle is the cluster center after optimization, and the arrow indicates the position change direction of the cluster center.

In the figure: (a) a zero order fuzzy model; (b) a first order fuzzy model.

Fig. 6 shows the cluster center before (triangle) and after (rectangle) optimization positions according to the embodiment of the present invention: c-9 schematic; the blue point is the entire data distribution, the red triangle represents the cluster center before optimization, the yellow rectangle is the cluster center after optimization, and the arrow indicates the change in the position of the cluster center.

In the figure: ((a) a zero-order blur model and (b) a first-order blur model.

FIG. 7 is a RMSE boxline graph before and after optimization of a cluster center in an artificial dataset according to an embodiment of the present invention;

in the figure: (a) a zero order fuzzy model (training set); (b) zero order fuzzy models (test set); (c) first order fuzzy models (training set); (d) first order fuzzy models (test set).

FIG. 8 is a schematic diagram of a Synthetic data set represented by a performance curve of non-uniform granularity distribution before and after cluster center optimization under different numbers of rules according to an embodiment of the present invention;

in the figure: (a) c is 3; (b) c is 5; (c) c is 7; (d) and c is 9.

FIG. 9 is a schematic diagram of a Housing data set represented by a performance curve of non-uniform granularity distribution before and after optimization of cluster centers under different numbers of rules according to an embodiment of the present invention;

in the figure: (a) c is 3; (b) c is 5; (c) c is 7; (d) and c is 9.

FIG. 10 is a schematic diagram of Concrete data set representation by performance curves of non-uniform granularity distribution before and after cluster center optimization under different numbers of rules provided by the embodiment of the present invention;

in the figure: (a) c is 3; (b) c is 5; (c) c is 7; (d) and c is 9.

FIG. 11 is a schematic diagram of a Mortgae data set represented by a performance curve of non-uniform granularity distribution before and after cluster center optimization under different numbers of rules according to an embodiment of the present invention;

in the figure: (a) c is 3; (b) c is 5; (c) c is 7; (d) and c is 9.

FIG. 12 is a representation of a QSAR dataset represented by a performance curve of non-uniform granularity distribution before and after cluster center optimization for different numbers of rules provided by an embodiment of the present invention;

in the figure: (a) c is 3; (b) c is 5; (c) c is 7; (d) and c is 9.

FIG. 13 is a schematic diagram of a Stock data set represented by a performance curve of non-uniform granularity distribution before and after cluster center optimization under different numbers of rules according to an embodiment of the present invention;

in the figure: (a) c is 3; (b) c is 5; (c) c is 7; (d) and c is 9.

FIG. 14 is a schematic diagram of a Wizmir data set represented by a performance curve of non-uniform granularity distribution before and after cluster center optimization under different numbers of rules according to an embodiment of the present invention;

in the figure: (a) c is 3; (b) c is 5; (c) c is 7; (d) and c is 9.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Aiming at the problems in the prior art, the invention provides a data-driven granularity modeling method based on rule optimization, and the invention is described in detail below with reference to the accompanying drawings.

As shown in fig. 1, the information granularity allocation method based on the fuzzy rule model according to the embodiment of the present invention includes the following steps:

s101: acquiring fuzzy rules for mathematical model construction based on a data clustering method, constructing zero-order and first-order TS fuzzy rule models, and realizing mapping of a data input space and a data output space;

s102: aiming at improving the mapping precision of the model to the data input space and the data output space, optimizing the distribution of the clustering center in the data space through a group optimization algorithm to realize the optimization of the fuzzy rule;

s103: designing a non-uniform distribution strategy, and expanding the optimized clustering center into hypercube information particles with higher robustness and fault tolerance, so as to form a model for outputting the information particles in an interval form;

s104: and aiming at the output rationality of the granularity model, the configuration of information particles around the clustering center is further optimized by adopting the coverage rate and the specificity evaluation index, so that the optimization and modeling of the granularity rule are realized.

The technical solution of the present invention is further described with reference to the following specific examples.

1. And (5) constructing a basic model.

The basis of fuzzy models is fuzzy rules, which in data-driven approaches are obtained by capturing structures in the data. A rule-based multi-input single-output TS (transport stream) system structure fuzzy model utilizes fuzzy reasoning and output to carry out fuzzy division on data so as to represent complex nonlinear relation, and provides a key part of fuzzy rule model system design. The model is described by an IF-THEN fuzzy rule, and each rule represents a subspace and is specifically expressed as follows:

IF x_kis A_i(x_k),THEN y_iis f_i(x_k)i＝1,...,c

where k is 1,2, N is the number of input data, and in fuzzy rules, x is_kIs an n-dimensional input variable, A_iIs a multivariate membership function of the ith rule, y_iIs the ith output, and f_i(x_k) Representing a local linear function.

In fuzzy rules, membership functions smoothly connect local linear models to form a global fuzzy model describing a nonlinear function. Clustering methods partition data of different attributes into different groups in an unsupervised manner, where the data within each group have similar characteristics. In view of such characteristics of the clustering method, data-driven based modeling can be realized in a data space by using clustering (FCM); input-output space Rⁿ⁺¹(v) of (c) is a cluster center (prototype) [ v ]_i,w_i]Obtained by FCM and its cluster number is equal to the number of rules. Membership function A_i(x_k) As part of the condition of the rule, it is described by the cluster center.

The output of the fuzzy inference is obtained according to the membership intensity of the data corresponding to each rule. Wherein the membership function A_i(x_k) Representing a fuzzy subspace, and the calculation formula is as follows:

where m is the blurring coefficient.

The structure identification and parameter estimation in the condition and conclusion part are very important in the process of establishing the fuzzy model, and aiming at the aspect, different processing methods are adopted for the zeroth-order model and the first-order model. In the zeroth order model, a set of cluster centers v is obtained by FCM₁,v₂,…,v_cAnd membership function A_i(x_k) Structures in the data are revealed, and the conclusion part is a constant obtained by minimizing the root Mean Square Error (MSE) between the actual output and the predicted output. It is defined as follows:

f_i(x)＝a₀a₁＝a₂＝…＝a_n＝0

the estimation process of the first order TS model is the same as the zeroth order model except for the constants of the theoretical section. The first order model assigns the projected cluster centers to each input variable in a linear fashion, which is essential to fit the output data with the underlying structure in the input model: by balancing the structure of the input and output spaces, the mapping of the cluster centers in the data space to the cluster centers in the output space achieves higher accuracy. Considering that the higher-order local model in the rule conclusion part can obtain higher precision, the invention refers to the Taylor expansion formula to convert the zero-order model into the higher-order model to obtain a novel strategy:

by integrating all relevant rules and membership functions, the invention can obtain target output

And the parameter a in the formula (4) and the formula (5) is obtained by minimizing the error Q between the target output and the model output by a least square error method, and the root Mean Square Error (MSE) is adopted as the quality of the fuzzy modelThe evaluation index of (1).

The invention finally establishes zero-order and first-order fuzzy models by applying a fuzzy clustering method in a data space.

2. Optimization of cluster centers

In a general TS fuzzy model, FCM obtains a membership function and a clustering center which form a condition part of the model in an unsupervised mode. Observing membership functions A in a fuzzy subspace_i(x_k) And a calculation process of a cluster center, it is obvious that the cluster center reflects an average feature of each subspace. The uniformly distributed nature of FCMs makes the cluster centers unable to fully express the structure of the data in the subspace, which may lead to a reduction in the accuracy of the model. At this stage, the core task of the present invention is to subdivide the data space from a data-driven perspective, so that the model rules form cluster centers that capture the essential features of the data structure.

Therefore, the optimization of the cluster center should satisfy the following three basic requirements: the optimized clustering center has determinism and is associated with a data space structure; the cluster center updating process in the data space can avoid overfitting caused by excessive pursuit of precision as far as possible; can adapt to the non-convex characteristic and has better convergence speed.

In order to meet the requirements in the process of optimizing the clustering center, the invention focuses on the selection of an optimization method. At present, evolutionary algorithms such as Genetic Algorithm (GA), particle swarm algorithm (PSO), Differential Evolution (DE) and the like tend to be mature and widely used, and the particle swarm optimization algorithm (PSO) which is easy to realize, high in precision and fast in convergence is selected finally. The particle swarm optimization algorithm is a group heuristic algorithm designed by simulating the predation behavior of birds. Each particle in the population represents a possible solution to the problem and the optimal solution is found through collaboration and information sharing among individuals in the population. In addition, the present invention selects an effective evaluation index as a fitness function in the optimization algorithm-the Root Mean Square Error (RMSE) of the predicted output and the actual output. The property that the RMSE is extremely sensitive to large or small errors in the prediction process may better reflect the accuracy of the prediction results. A Wilcoxon rank sum test, which does not require data distribution information, was used to analyze the effectiveness of the optimization method. In the significance analysis, Wilcoxn rank-sum test makes a null hypothesis: the data in the two independent samples are from a continuous distribution with the same median. And the returned test result is a positive scalar value p and a logical value h to judge the hypothesis.

And integrating the clustering centers into particle position vectors, and searching an optimal solution of the clustering centers in a search space. In the PSO algorithm, assuming that there is only one best solution to the problem in the region (cluster center location of the cluster center that minimizes RMSE), the particles will share each other's location during the PSO search process, collaborating to determine if the population has found the best solution and providing solution information. Finally, by adjusting the position and velocity of each particle, the population is clustered around the position of the solution, which means that the optimal solution to the problem has been found. Particle velocity Vel_idAnd position Pos_idThe updating method comprises the following steps:

Vel_id＝αVel_id+ζ₁r₁(P_id-Pos_id)+ζ₂r₂(P_gd-Pos_id)

Pos_id＝Pos_id+Vel_id

α is called an inertia factor (α > 0) reflects the "habit" of particle motion, balancing the global and local optimization behavior by adjusting the size of α, ζ₁And ζ₂Is a learning factor, also known as an acceleration constant; r is₁And r₂Is [0, 1 ]]A uniform random number within a range; p_idD-dimension, P, representing respective extrema of i-th variable_gdThe d-th dimension representing the global optimal solution. To avoid computational complexity, the invention sets an upper bound for the moving particles andthe lower limit.

3. Granularity design of fuzzy rule model

As a novel information processing mode, the granularity model promotes the numerical model to a more abstract level by allowing a certain level of information granularity, so that the model has higher tolerance. Different methods of information granularity allocation also provide solutions for various types of data. The invention expands the TS fuzzy model through the distribution of information granularity. FCM and optimization algorithms are used to update the cluster centers, returning the range distribution of output possible values with information granularity. The final goal of this stage is to generate a granularity model by performing granularity allocation on the cluster centers before and after optimization. In order to evaluate the granularity model, the invention selects two performance indexes specificity and coverage related to the information granularity, and the specific information granularity distribution strategy is described in the section.

3.1 index for evaluation of particle size

In order to more reasonably quantify the performance of the granularity model, the invention adopts a more beneficial quantification method, namely a granularity evaluation method. Two complementary evaluation indexes are commonly used for judging information granularity Y_kWith respect to the number structure y_kThe performance of (c).

(1) Coverage (Coverage): the index is used for judging the quality and the precision of the output of the granularity model. In other words, the coverage rate represents the granularity Y of information generated by the model_k"overlay" y_kTo the extent of (c).

The higher the value of coverage, the better the model performance. In the most ideal case, the value of coverage is 1, which proves that y_kAll values of (a) are contained in the interval of information granularity.

(2) Specificity (Specificity): the index describes the granularity of information Y_kThe level of detail of. Briefly, the specificity is Y_kThe decreasing function of the interval length increases with decreasing interval length. When the allocated information granularity is 0, it means Y_kDegeneration becomes a point at which the specificity reaches a maximumThe value is obtained. Specificity can be defined as:

wherein

Is the output interval determined by the decision. In addition to the above exponential function, any continuously decreasing function of interval length is feasible as long as the marginal condition is met: when the interval length is 0, the value of specificity is 1. Both indices are constrained, and are therefore represented by cov () and spe ().

The specificity and coverage are conflicting in nature. With increasing specificity, the value of specificity gradually decreased and the value of coverage gradually increased. To determine the best granularity of information, the evaluation index of the optimization problem is defined as:

Φ()＝cov()·spe()；

with Φ () can be used as an objective function to optimize granularity information to achieve an optimal information granularity allocation. To evaluate the overall performance of the model, the invention plots a curve in spe-cov coordinates and calculates the area under the curve (AUC), as expressed below:

where s is the number of assumed values.

The granular fuzzy model is mainly realized by distributing interval information granularity to input variables, model parameters or output variables. In order to improve the characterization capability of the internal features of the data, the invention is the main parameter of the model (such as the clustering center [ v [ ])_i,w_i]) A certain degree of information granularity is allocated and granularity integration is achieved in interval form.

3.2 information granularity Allocation policy

Based on the fuzzy model of the TS rule, the invention selects a first-order fuzzy model to balance input and output spaces in the identification process. The location of the cluster center has a significant impact on the accuracy of the model. In practice, it is difficult to obtain the ideal position of the cluster center, and the granularity model provides a feasible solution for the present invention. The present invention focuses on modeling the clustering centers in intervals to achieve the granulation. The granularity rules are defined as follows:

it represents the interval nonlinear membership function of the clustering center based on the granularity rule. Two symbols (

And

) Indicating the calculation of the interval. The present invention provides an information grain for numerical models and uses a function phi () related to the accuracy of the result as a metric. At this time, the granularity space of the model parameters will be converted from the numerical model into the form of information granularity.

Assuming that the input is a data set x in an n-dimensional space, the ith v is obtained by a first-order fuzzy model_i(i is the number of rules), then any entry in dimension n-1 can be replaced with an interval. The invention constructs a hypercube clustering center V_iWherein, the cube V_iMay be of unequal length and asymmetrically distributed around the cluster center. Since the ith regular granulation for a given dataset x occurs in interval form, the numerical vector and hypercube cluster center V must be quantified_iThe distance between them. Granulation of high-dimensional hypercubes means from vectors to hypercubes V_iIs not feasible. An effective solution is to project the hypercube onto the corresponding coordinate axis e_iAnd in the same j-dimension, computing the projection of the vector into the hypercubeFarthest and closest distances of the hatched region. The interval boundary is represented by fig. 2, and the calculation formula is as follows:

wherein m is more than 1; this is the same as m in the construction process of the fuzzy rule model, let:

the complete calculation of the output interval is given by:

3.3 optimal Allocation of information granularity

The information particles assigned to each cluster center can be implemented in a variety of ways, with different methods providing different flexibility to the granularity model. The invention selects a non-uniform distribution strategy as the basis of cluster center granularity distribution, and assumes that the lengths of different intervals around the parameter values are unequal, but the constraint of granularity information overall balance must be satisfied.

Wherein the content of the first and second substances,

variables of

And

the method comprises the following steps:

in the non-uniform and asymmetric distribution strategies, uncertain values of all information granularities need to be acquired, a heuristic algorithm is feasible at the scale, and the PSO is still selected in the invention for accelerating the solving process. The optimal allocation aims at allocating information granularity based on fuzzy rules of the respective model. To avoid errors that may be introduced by the even distribution of FCMs, the present invention assigns information granularity to the cluster centers before and after optimization, respectively. Therefore, the method can visually compare the change of the model before and after the adjustment of the clustering center and evaluate the influence of the internal structure of the mined data on the granulation fuzzy model.

The technical effects of the present invention will be described in detail with reference to experiments.

The invention takes an artificial data set and six real data sets as examples to carry out experiments. Based on the FCM algorithm, the effectiveness of extracting the TS fuzzy model input space structure features is verified. And (4) distributing information granularity to the numerical model by utilizing non-uniform and non-symmetric strategies so as to establish a granulation model.

1. Experiment with artificial data set

A two-dimensional data set is generated by the following distribution:

wherein x₁And x₂Is the interval [1, 5]The input variable of (1); the invention sets the number of data sets to 800, and adopts a ten-fold cross-validation method to divide the data to validate the built model and evaluate the performance.

The invention introduces a rule-based modeling process of two TS fuzzy model strategies, uses RMSE as a measure to carry out detailed comparison, and explains the superiority of the two TS fuzzy model strategies. In the process, the invention obtains the clustering center c of the data subspace through a standard FCM algorithm (the invention analyzes c by setting c to 3, 5,7 and 9), which is equivalent to determining the input rule number of the granulation model. The value of the fuzzy coefficient m has a great influence on the performance of the FCM algorithm. When m is 1, FCM degenerates to a pure c-means clustering algorithm (HCM). As m approaches infinity, the center of each class is nearly identical to the center of gravity of each data. In practice, the choice of m depends on the data itself. It is verified that the invention can obtain better result when m is 2, so the invention sets the fuzzification coefficient m to be 2. By using PSO, the present invention can obtain the optimal clustering center of the model that minimizes RMSE.

The parameter settings for the PSO algorithm are as follows: the number of particles is 200 and the function tolerance is "1 e-6". The number of iterations is set to 600, taking into account the time consumption, at which point the model can already obtain considerable results. Coefficient of acceleration ζ₁And ζ₂Are all 2, acceleration weight coefficient r₁And r₂In [0, 1 ]]And (4) internally generating randomly. The present invention sets the upper and lower bounds of the data space to constrain the results. The upper bound is set to 1.2 times the maximum value of each dimension and the lower bound is set to 0.8 times the minimum value of each dimension. Meanwhile, the method adopts Wilcoxn rank sum test to analyze results of two strategies before and after the optimization of the clustering center so as to verify the effectiveness of the optimization algorithm. When h ═ 1, it was shown that the original hypothesis was rejected at the significance level of p, in other words, the optimization results were statistically significantly different. Fig. 3-6 depict the migration process of cluster centers for two strategies when c is different. The solid rectangle is the initial cluster center, the solid triangle is the cluster center after the optimization result, and the other solid points represent the distribution of the data. In order to further analyze the influence of the number of the clustering centers on the two strategies, the method adjusts the value of c and compares the results before and after optimization by using a ten-fold cross-validation method. After optimization, fig. 7 visualizes the results of these experiments.

As shown in fig. 3-6, the unoptimized cluster centers in the model are evenly distributed in the data. After PSO optimization, the present invention found that cluster centers spread from uniform distribution to edges, and that some cluster centers were close to each other, showing a tendency to re-cluster. In addition, as shown in fig. 7, for the first-order model before and after the cluster center optimization, when c is 3 in the test set, the result after the cluster center optimization is 69.54% higher than that before the optimization. When c is 5,7,9, the corresponding lift rates are 77.46%, 84.77% and 86.39%, respectively. In the zeroth order model, the experimental results still have the same trend. The training set results are similar to the test set. The invention carries out significance test, and experiments prove that the results before and after optimizing the clustering center have significant difference in statistics. The optimized cluster centers build models and achieve better results by identifying potential data structures in the data space. The results prove that the hidden structure in the data is learned through the clustering center, so that the performance of the model is improved.

Further, as c in fig. 7 increases, RMSE gradually decreases. In the test set of the zeroth order model, when c was gradually changed from 3 to 5,7 and 9, the RMSE improvement rates of the cluster centers before optimization were 19.64%, 21.80% and 32.12%, respectively, while the improvement rates of the model after cluster center optimization were 29.14%, 47.78% and 55.54%, respectively. There is a similar trend in the first order model. It can be seen that increasing the number of cluster centers (the number of model rules) in the data helps to improve the accuracy of the model. However, it is not desirable to increase the value of c blindly, and when the number of cluster centers is too large, the first-order model may generate an overfitting phenomenon.

The invention assigns information granularity to parameters (cluster centers) in the model to construct a more tolerant granularity model. At the same time, the present invention recalls PSO to achieve the best allocation of granularity. PSO has a particle number of 100, ζ₁＝ζ₂The parameters not mentioned are the same as before 2.01. The selection of information particles is directly related to the model performance. In exploring the influence of the cluster center position and its parameters on the model, it is necessary to reduce the influence. Obviously, in the numerical model, the value of c is large, and the high precision caused by the optimized clustering center is not suitable for the fuzzy model. Too large can dramatically shorten specificity, but cannot allow coverage to continue to grow, limiting the effectiveness of the model. As shown in table 1, the present invention will be set to 0.1, except that c is 9. In view of the above, the present invention will set 0.05 when c is 9, and set the number of particles in the optimization process to 200. The parameter values are determined through experiments, and the calculation speed can be increased as much as possible on the premise of meeting the requirements of the model.

TABLE 1 information granularity settings at different c-values

Fig. 8 shows the results before and after non-uniform and asymmetric distribution of granularity using different numbers of rules (3, 5,7, and 9). In order to simplify the expression, the granulation model before cluster center optimization and the granulation model after cluster center optimization are respectively marked by GMIP and GMOP in the legend. All parameter combinations are subjected to a ten-fold cross-validation method, and are evaluated by coverage rate and specificity indexes.

Table 2 AUC values of the particle size model based on the information particle size optimal distribution under different numbers of initial cluster centers and optimized cluster centers: synthetic data

FIG. 8 shows the relationship between coverage and specificity of the particle size model when different prototypes were used. The solid line represents the results of the training set and the dashed line the results of the test set. This annotation will continue to be used in the following experiments unless specifically noted. As coverage increases, coverage decreases. It can be seen that when c is relatively small, the coverage of the cluster center optimized granulation model is better than the model using the original cluster center without sacrificing coverage. However, when the value of c is large, although the coverage is still large, the coverage becomes poor. The sensitivity of the fuzzy model to c increases with the optimization of the cluster center. It can be seen that even if affected by too many rules, the cluster-centric optimized granularity model captures some spatial structural features and covers most of the target output. In addition, there are outliers in the cov-spe coordinate curve due to the randomness of the PSO optimization in the distribution of information granularity. In addition, the product of the coverage rate and the coverage rate is used as a fitness function, and the oppositiveness of the two evaluation indexes leads the model to sacrifice the coverage rate in the optimization process.

Table 2 shows the effectiveness of the particle size model when different cluster centers are used. This method gives better results than the prototype obtained by FCM. No matter how c changes, the granularity model after the cluster center optimization can obtain a higher AUC value by means of learning of the hidden structure in the data space.

2 true data experiment

The invention was experimented with six real data sets (for details, see table 3). Similar to the manual data set experiment, the invention uses two methods to construct the TS fuzzy model and optimizes the clustering center by PSO (same as the parameter setting). Then, a granularity model is further established on the basis of the rule-based first-order fuzzy model.

TABLE 3 introduction of the actual data set in the experiment

For equation (1), the fuzzy set in the condition section is determined by a standard FCM with fuzzy coefficients, and the constant in the conclusion section is obtained by minimizing the sum of squared errors. Once the numerical model is built, information granularity is assigned to the cluster center of the model, maximizing the evaluation index Φ (). The values of (a) are in table 1, and the values of other parameters are the same as those of the artificial data set.

TABLE 4 RMSE before and after optimization of clustering centers in zeroth order model for two strategies under different clustering numbers

The RMSE of the zeroth and first order fuzzy models before and after cluster center optimization is shown in tables 4 and 5 (the bold part in the figure indicates that there is a significant difference in the results). The results of the training set are similar to the results of the test set, and the invention focuses on analyzing the test set. In general, as the number of cluster centers increases, the performance of both models becomes better. However, in Housing and QSAR, the optimization results of the zeroth order model cluster centers do not have a significant trend that varies with the number of rules. According to previous experiments, the zeroth order model has no outliers during the experiment, which indicates that the zeroth order model is more robust to overfitting. It is clear that the first order model performs better than the zero order model. However, when house was cross-validated, its RMSE value was as high as 58. By comparing the RMSE variation of the first order model under different number of rules, the present inventors have found that the higher c, the greater the likelihood of RMSE anomalies. The likelihood of overfitting the first order model is higher than the zeroth order model, which is at the cost of obtaining higher fitting accuracy.

TABLE 5 RMSE before and after optimization of clustering centers in first-order model for two strategies under different clustering numbers

From the RMSE index, the change trend of the model after the optimization of the clustering center is similar to that of the model before the optimization of the clustering center. In the test set, the RMSE lifting rate of the zeroth-order model after the clustering center is optimized is between 20% and 78%, and the result of the corresponding first-order model is improved by at least 24%, and is still higher than the result of the zeroth-order model. Analyzing the optimization result of Housing, the invention finds that the previous overfitting problem from the first-order fuzzy model disappears after the cluster center is optimized, which proves that optimizing the cluster center has a positive influence on the model. And carrying out Wilcoxon rank sum test on the result of the model after the cluster center optimization to judge whether the result has statistical significance. On the whole, almost all the results of the data sets have significant differences, which can show that the optimization by adopting the PSO algorithm is effective.

In addition, in order to obtain more details after optimizing the clustering center, the invention performs granularity allocation on the established TS model, and the specific situations are shown in fig. 9-fig. 14.

Table 6 AUC values of the particle size model based on the information particle size optimal distribution under different numbers of initial cluster centers and optimized cluster centers: real data set

The present inventors have observed that the plots of FIGS. 9-14 show that even though the data sets tested differ in structure, a consensus can be drawn from the cov-spe plot.

(1) The overall performance of the optimized model is generally better than the original model. Furthermore, whatever type of model is analyzed, the metrics on the training set are superior to the metrics on the test set under the same rules.

(2) The performance of the granular model is related to the number of rules. It can be seen that as the number of rules increases, the coverage rate gradually increases, while the corresponding specificity decreases. The coverage growth rate of the granular model based on the initial clustering center is relatively slow compared to the model based on the optimized clustering center. For cluster-centric optimization models, specificity is usually low, but c ═ 9 is a special case, and models achieve high coverage at the expense of specificity. The AUC curve of the particle size model in the graph has an abnormal phenomenon, which is caused by using the product Φ () of two evaluation indexes as a fitness function, and the phenomenon is mentioned in the manual data set experiment.

The AUC values of the particle size model based on two different cluster centers are shown in table 6. In all test sets except the data set wizimir, the AUC value of the particle size model optimizing the cluster center is better than the particle size model of the initial cluster center with the same c. The AUC results for wizmir are poor when c is 3 and c is 5. In Housing, QSAR and Stock datasets, as the number of rules increases from 3 to 7, the model results after cluster center optimization continue to improve. In the Mortgage dataset, the AUC of the model after cluster center optimization gradually decreased instead. When c is 9, the AUC of the optimized cluster center model is still better than the initial cluster center, except for the houseing dataset. AUC has excellent performance in both QSAR and Stock data sets even though the value at c is less than it. At this time, AUC of the concret, Stock and Mortgage data sets is similar to the case where c is 7. These results indicate that optimizing the cluster centers in the numerical model is an ideal and efficient method. The location is updated by learning hidden data structures, which is crucial to building a granular model. However, due to the high sensitivity of the model, an excessive increase in the number of rules may result in overfitting. In addition, too high a data dimension increases the difficulty of structure learning, thereby affecting the performance of the granular model.

In order to intuitively and convincingly explore the influence of a data driving method on a fuzzy rule model, the invention provides a granularity model paying attention to a data space structure. The model captures hidden structural information in a data space by optimizing a clustering center in the existing numerical model, and obtains a better result by combining an information granularity distribution strategy. On manual and real data sets, experiments carried out by using different fuzzy strategies and granularity distribution strategies prove that the model has higher fitting capacity. The present invention is directed to a structure of a cluster-centric learning data space obtained by clustering. In view of the difficult prediction and easy overfitting of the position of the cluster center, in future research, the invention considers a learning method adopting a combination of multiple data spaces. An attempt is made to build a granularity model directly on the output data, avoiding deviations in the numerical model conversion process. And simultaneously, a novel precondition parameter identification method, such as a neural network, is selected to obtain more accurate input space structure characteristics. Combining the two strategies to obtain the structural information of the input and output space may achieve better results.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A data-driven granularity modeling method based on rule optimization is characterized in that the fuzzy rule optimization and information granularity distribution method comprises the following steps:

secondly, optimizing the distribution of the prototypes of the clustering centers in the data space by a group optimization algorithm to realize the optimization of fuzzy rules, aiming at improving the mapping precision of the model to the data input space and the data output space;

the third step: designing a non-uniform distribution strategy, expanding the optimized clustering center into hypercube information particles with robustness and fault tolerance, and forming a model to output the information particles in an interval form;

2. The fuzzy rule modeling based on clustering algorithm according to claim 1, wherein the fuzzy model uses fuzzy C means FCM algorithm to realize the identification of the precondition parameters in the rule; taking a TS model structure as an example, the model establishes local linear models in each fuzzy subspace, and the local models are smoothly connected through a membership function obtained by FCM to form a global fuzzy model describing a nonlinear function; based on different parameters of a theory part in the rule, a modeling method is designed by taking a zero-order fuzzy model and a first-order fuzzy model as examples.

3. The data structure-based fuzzy rule optimization method of claim 1, wherein said fuzzy rule optimizes the FCM's result cluster center from a data-driven perspective, the cluster center optimization method satisfying: the optimized cluster centers are deterministic and associated with the data space structure.

4. The fuzzy rule optimization method based on the data structure as claimed in claim 3, wherein the method improves the model on the premise of satisfying the optimization rule, the PSO has the characteristics of easy realization, high precision and fast convergence, and the PSO technology is selected to optimize the clustering center and the related fuzzy rule thereof; integrating the clustering centers into particle positions, and searching an optimal solution of the clustering centers in a search space; PSO realizes particle velocity Vel through information interaction and cooperation among groups_idAnd position Pos_idTo finally obtain the optimumThe solution is specifically expressed as follows:

Vel_id＝αVel_id+ζ₁r₁(P_id-Pos_id)+ζ₂r₂(P_gd-Pos_id)；

Pos_id＝Pos_id+Vel_id；

α is called inertia factor reflecting the habit of particle motion, α > 0, and the behavior of global and local optimization performance is balanced by adjusting α size, ζ₁And ζ₂Is a learning factor, also known as an acceleration constant; r is₁And r₂Is [0, 1 ]]A uniform random number within a range; p_idD-dimension, P, representing respective extrema of i-th variable_gdThe d-th dimension, representing the global optimal solution, sets the upper and lower bounds of the search space for the moving particles in the PSO.

5. The information granularity distribution method based on the fuzzy rule model as claimed in claim 1, wherein the information granularity distribution method based on the fuzzy rule model selects a non-uniform distribution strategy as a basis for the granularity distribution of the clustering centers before and after the optimization, different intervals have unequal lengths in different dimensions of the clustering centers, and satisfy the constraint of the overall granularity information balance:

wherein the content of the first and second substances,

variables of

And

the method comprises the following steps:

6. the fuzzy rule model-based information granularity allocation method of claim 5, wherein the fuzzy rule model-based information granularity allocation method defines granularity rules:

And

calculating a representation interval; the input is a data set x in n-dimensional space, and the ith v is obtained by a first-order fuzzy model_iI is the number of rules; simultaneously, a hypercube clustering center V is constructed by adopting a non-uniform distribution strategy_i(ii) a Projecting hypercubes onto respective coordinate axes e_iAnd in the same j-th dimension, the farthest and closest distances of the projection interval of the vector to the hypercube are calculated.The calculation formula is as follows:

wherein m > 1:

the complete calculation of the output interval is as follows:

7. the assessment method of granularity model with the aim of outputting rationality according to claim 4, characterized in that information granularity quality is assessed with reasonable indexes of Coverage and Specificity specifically to realize the optimization and modeling of granularity rules;

coverage rate: used for judging the accuracy of the output of the granularity model, whether target information is contained or not and representing the information particles Y generated by the model_kOverlay y_kDegree of (c):

specificity: the index describes the granularity of information Y_kTo a detailed extent, specificity is Y_kA decreasing function of the interval length, increasing with decreasing interval; when the information granularity of the distribution is 0, Y_kThe degeneration becomes a point when the specificity reaches a maximum, which is defined as:

wherein

Is the output interval determined by the decision; both indices are constrained, represented by cov () and spe (); the merit function of the granular optimization function is defined as:

Φ()＝cov()·spe()；

where s is the number of set points.

8. The optimal information granularity distribution method based on the fuzzy rule model as claimed in claim 4, wherein the further optimization of the information granularity distribution is realized by using the product of specificity and coverage as a fitness function in the optimization process, the optimal distribution of the information granularity distribution based on the fuzzy rule of the model needs to obtain uncertainty values of all the information granularity, and the particle swarm algorithm PSO is selected to realize the construction of the optimal granularity rule.

9. Use of the rule-based optimized data-driven granular modeling method according to any one of claims 1 to 8 in data mining and knowledge discovery.