WO2024045005A1

WO2024045005A1 - Data classification method based on dynamic bayesian network classifier

Info

Publication number: WO2024045005A1
Application number: PCT/CN2022/116055
Authority: WO
Inventors: 周亮; 吴韬; 张斯雯; 孔平; 王双成
Original assignee: 上海健康医学院
Priority date: 2022-08-31
Filing date: 2022-08-31
Publication date: 2024-03-07

Abstract

The present invention belongs to the technical field of data classification. Provided is a data classification method based on a dynamic Bayesian network classifier. The data classification method comprises: acquiring a time series sample data set; constructing a Bayesian network classifier according to the time series sample data set, and learning the structure and the weight coefficient of the Bayesian network classifier, so as to determine an optimal classifier; and on the basis of the optimal classifier, determining a class variable corresponding to each attribute variable to be classified in time series data to be classified. Therefore, time series data can be accurately classified.

Description

A data classification method based on dynamic Bayesian network classifier

Technical field

The invention relates to the field of data classification, and in particular to a data classification method based on a dynamic Bayesian network classifier.

Background technique

Changes in classes and attributes of time series data are not synchronous. Dynamic Bayesian network is an extension of traditional Bayesian network and is suitable for solving time-related uncertainty problems, such as solving economic fields such as stock trend prediction and so on. Problems in medical fields such as disease diagnosis and prediction. Since the directed edges in the structure are more prominent in expressing causal relationships rather than phasing the channels or paths of information transmission, they are more suitable for dynamic analysis and inferential calculations and are not suitable for direct classification calculations.

Contents of the invention

The purpose of the present invention is to provide a data classification method based on a dynamic Bayesian network classifier, which can accurately classify time series data.

In order to achieve the above objects, the present invention provides the following solutions:

A data classification method based on dynamic Bayesian network classifier, including:

Obtain the time series sample data set; the time series sample data set includes sample attribute variables at multiple historical time points, the actual class variables corresponding to each sample attribute variable, the transitive dependency information of each sample attribute variable, direct export dependency information and indirect export reliance on information;

Build a Bayesian network classifier based on the time series sample data set, learn the Bayesian network classifier structure and weight coefficients, and determine the optimal classifier;

Obtain time series data to be classified;

Based on the optimal classifier, determine the class variables corresponding to each attribute variable to be classified in the time series data to be classified.

According to specific embodiments provided by the present invention, the present invention discloses the following technical effects: building a Bayesian network classifier based on time series sample data sets can accurately classify time series data.

Description of drawings

Figure 1 is a flow chart of the data classification method based on the dynamic Bayesian network classifier;

Figure 2 shows the local structure of the classifier;

Figure 3 is a schematic diagram of the evolution model of the classifier.

Detailed ways

In order to make the above objects, features and advantages of the present invention more obvious and understandable, the present invention will be described in further detail below with reference to the accompanying drawings and specific embodiments.

As shown in Figure 1, data classification methods based on dynamic Bayesian network classifier include:

S1: Obtain the time series sample data set. The time series sample data set includes sample attribute variables at multiple historical time points, actual class variables corresponding to each sample attribute variable, transitive dependency information, direct export dependency information, and indirect export dependency information.

S2: Construct a Bayesian network classifier based on the time series sample data set, learn the structure and weight coefficients of the Bayesian network classifier, and determine the optimal classifier.

S3: Obtain the time series data to be classified. The time series data to be classified includes attribute variables to be classified at multiple current time points, transitive dependency information, direct export dependency information, and indirect export dependency information of each attribute variable to be classified.

S4: Based on the optimal classifier, determine the class variables corresponding to each attribute variable to be classified in the time series data to be classified.

Convert non-time series data sets into time series data sets, using X ₁ [t], X ₂ [t],..., X _n [t], C[t] to represent time series attribute variables and class variables respectively, Where t takes a discrete time point and 1≤t≤T, x ₁ [t], x ₂ [t],..., x _n [t], c[t] are their specific values, D[n ,T]＝{x ₁ [t],x ₂ [t],...,x _n [t],c[t]|1≤t≤T} is a time series classification data set with T records, There are temporal dependencies between records in D[n,T]. First, a classifier is established based on the given time series data set D[n,T], and then the established classifier is used to

Make predictions where

is the order of the classifier. In this embodiment, for

Conduct research on time classifiers.

S1 includes: obtaining the first time series data set. The first time series data set includes sample attribute variables at multiple historical time points, actual class variables corresponding to each sample attribute variable, transitive dependency information, direct derived dependency information and indirect derived dependency information. Based on the Markov hypothesis, time series transformation is performed on the first time series data set to obtain the second time series data set. Based on the misalignment transformation of the dynamic Bayesian network classifier order, the misalignment correspondence between the sample attribute variables and the class variables in the second time series data set is established, and the time series sample data set is obtained.

The converted time series sample data set is D[n,T]={x ₁ [t-1],x ₁ [t],x ₂ [t-1],x ₂ [t],...,x _n [t-1],x _n [t],c[t],c[t+1]|2≤t≤T}.

S2 includes: using the maximum likelihood estimation method to determine the initial attribute tree based on the time series sample data set. The initial attribute tree includes each sample attribute variable and the first prediction class variable corresponding to each sample attribute variable. Specifically, for any sample attribute variable, the classification accuracy of the actual class variable corresponding to the sample attribute variable and the target sample attribute variable is calculated; the target sample attribute variable is any other sample attribute variable in the time series sample data set. The target sample attribute variable corresponding to the maximum classification accuracy is used as the first prediction class variable of the sample attribute variable. According to each sample attribute variable and the corresponding first prediction class variable, the forward greedy search method is used to perform attribute tree learning to obtain the initial attribute tree. The first classification accuracy rate is determined based on the first predicted class variable and the true class variable corresponding to each sample attribute variable. Select sample attribute variables at any number of consecutive time points from the time series sample data set to obtain the time series segment data set. Based on the greedy random search method, based on the time series segment data set and the first classification accuracy, the maximum likelihood estimation method is used to optimize the initial attribute tree to obtain the optimal attribute tree. The optimal attribute tree includes each sample attribute variable and the second prediction class variable corresponding to each sample attribute variable. The optimal attribute tree is the structure of the Bayesian network classifier. The second classification accuracy is determined based on the second predicted class variables and actual class variables corresponding to each attribute variable. Based on the first classification accuracy, the second classification accuracy, the initial attribute tree and the optimal attribute tree, determine the weight coefficient of the Bayesian network classifier to obtain the optimal classifier.

Use the following formula to determine the first classification accuracy:

accuracy(D[n,T],T ₀ ) is the first classification accuracy, D[n,T] is the time series sample data set, n is the number of sample attribute variables, T is the total period, and T ₀ is the test threshold , c _prediction [t] is the first prediction class variable of the sample attribute variable at time point t, and c _true [t] is the true class variable of the sample attribute variable at time point t.

Estimate the probability distribution of the initial attribute tree and the probability distribution of the optimal attribute tree. According to the probability distribution of the initial attribute tree, the probability distribution of the optimal attribute tree, the first classification accuracy and the second classification accuracy, determine the weight coefficient of the Bayesian network classifier:

Among them, p _new is the weight coefficient of the Bayesian network classifier, α is the first classification accuracy, β is the second classification accuracy, p _before is the probability distribution of the initial attribute tree, and p _after is the probability of the optimal attribute tree. distributed.

The class node in the classifier is the parent node of all attribute nodes, allowing the classifier to make full use of transitive dependency information. Tree or forest structure and density estimation based on Gaussian functions between attributes effectively exploits direct and indirect derived dependency information to avoid overfitting of the data. The time-delay transformation of variables integrates time-delay and non-time-delay information, and the dislocation transformation realizes asynchronous classification and prediction. Evolutionary learning and classification modes enable the optimal classifier to continuously accumulate classification information and improve classification capabilities.

The structure and expression of the classifier: given X ₁ [t-1], X ₂ [t-1],..., X _n [t-1], C[t], X ₁ [t], X ₂ [t],...,X _n [t],C[t+1] and other time delay variables are conditionally independent under the Markov property assumption. According to Bayesian network theory and the conditional independence relationship in Figure 2, we get:

γ is a quantity independent of c[t+1], π _i [t-1] is X _i [t-1] in X ₁ [t-1],...X _i-1 [t-1] The value of parent node Π _i [t-1], π _j [t-1,t] is X _i [t] in X ₁ [t-1],...X _n [t-1],X ₁ [t],...X _i-1 [t] The value of the parent node Π _i [t-1,t], f(.) is the density, p(c[t+1]|c[ t],x ₁ [t-1],...,x _n [t-1],x ₁ [t],...,x _n [t]) is the sample x ₁ [t-1],. ..,x _n [t-1],x ₁ [t],..., the probability that x _n [t] belongs to class c[t+1].

The probability is estimated based on the maximum likelihood method, and the attribute density is estimated using the Gaussian function.

For the time series data set D[n,T], the threshold T ₀ is determined based on the time series size, class probability validity, attribute density estimation or actual needs. accuracy(fmdbn,D[n,T],T ₀ ) is the classification accuracy of the classifier, c _prediction [TT ₀ +1] is using D[n,TT ₀ ] as the training set for c[TT ₀ +1] The classification result of c _true [TT ₀ +1] is the real result, then

in

The learning of the classifier is divided into initial learning and evolutionary learning, and each stage includes structure learning and parameter learning of ordered variables. Structure learning is at the core, and parameters can be estimated from the classifier structure and the input data set. Structural learning focuses on the construction and adjustment of attribute trees or forests.

(1) Initial learning: Initialize the attribute tree, combine the temporal progressive classification accuracy standard, attribute order and forward greedy search method to perform attribute tree learning to obtain a locally optimal attribute tree. For a given attribute order, parent nodes can only be searched among previous attributes, and each attribute has at most one parent node, thus forming a locally optimal attribute tree or forest.

(2) Evolutionary learning: The attribute tree obtained through initial learning needs to be continuously adjusted. After each adjustment, the new classifier is used as the basic classifier for the next round of adjustment. Avoiding convergence of local optima through a greedy random search process. For any node, a random integer between 0 and 2n is generated, and the node corresponding to the random number is used as the initial parent node b _j of the node. Calculate the classification accuracy based on b _j and the actual class variable of the node, and use the node with the largest classification accuracy as the parent node of the attribute variable.

(3) Evolutionary classification calculation: Classification calculation is performed based on the new classifier obtained by averaging the classifier models before and after adjustment.

Use FMDBN _before and FMDBN _after respectively to represent the classifier obtained by {before_a _h |1≤h≤2n} before adjustment and {after_a _h |1≤h≤2n} after adjustment, and perform model averaging on FMDBN _before and FMDBN _after , a new classifier FMDBN _new is obtained, as shown in Figure 3. The classification information of the classifier is continuously accumulated and compressed through iteration to improve the classification ability of the classifier.

This article uses specific examples to illustrate the principles and implementations of the present invention. The content of this description should not be understood as limiting the present invention.

Claims

A data classification method based on a dynamic Bayesian network classifier, characterized in that the data classification method includes:

Obtain the time series sample data set; the time series sample data set includes sample attribute variables at multiple historical time points, the actual class variables corresponding to each sample attribute variable, the transitive dependency information of each sample attribute variable, direct export dependency information and indirect export reliance on information;

Build a Bayesian network classifier based on the time series sample data set, learn the Bayesian network classifier structure and weight coefficients, and determine the optimal classifier;

Obtain time series data to be classified;

Based on the optimal classifier, determine the class variables corresponding to each attribute variable to be classified in the time series data to be classified.
The data classification method based on dynamic Bayesian network classifier according to claim 1, characterized in that said obtaining a time series sample data set specifically includes:

Obtain the first time series data set; the first time series data set includes sample attribute variables at multiple historical time points, actual class variables corresponding to each sample attribute variable, transitive dependency information of each sample attribute variable, direct export dependency information and indirect Export dependency information;

Based on the Markov hypothesis, perform time series transformation on the first time series data set to obtain the second time series data set;

Based on the misalignment transformation of the dynamic Bayesian network classifier order, the misalignment correspondence between the sample attribute variables and the class variables in the second time series data set is established to obtain the time series sample data set.
The data classification method based on dynamic Bayesian network classifier according to claim 1, characterized in that the Bayesian network classifier is constructed according to the time series sample data set, and the Bayesian network classifier structure and Learn the weight coefficients to determine the optimal classifier, including:

According to the time series sample data set, the maximum likelihood estimation method is used to determine the initial attribute tree; the initial attribute tree includes each sample attribute variable and the first prediction class variable corresponding to each sample attribute variable;

Determine the first classification accuracy based on the first predicted class variables and true class variables corresponding to each sample attribute variable;

Select sample attribute variables at any number of consecutive time points from the time series sample data set to obtain a time series segment data set;

Based on the greedy random search method, based on the time series segment data set and the first classification accuracy, the maximum likelihood estimation method is used to optimize the initial attribute tree to obtain the optimal attribute tree; the optimal attribute tree includes each sample attribute variable and The second prediction class variable corresponding to each sample attribute variable; the optimal attribute tree is the structure of the Bayesian network classifier;

Determine the second classification accuracy based on the second predicted class variables and actual class variables corresponding to each attribute variable;

Based on the first classification accuracy, the second classification accuracy, the initial attribute tree and the optimal attribute tree, determine the weight coefficient of the Bayesian network classifier to obtain the optimal classifier.
The data classification method based on dynamic Bayesian network classifier according to claim 3, characterized in that the maximum likelihood estimation method is used to determine the initial attribute tree according to the time series sample data set, specifically including:

For any sample attribute variable, calculate the classification accuracy of the actual class variable corresponding to the sample attribute variable and the target sample attribute variable; the target sample attribute variable is any other sample attribute variable in the time series sample data set;

Use the target sample attribute variable corresponding to the maximum classification accuracy as the first prediction class variable of the sample attribute variable;

According to each sample attribute variable and the corresponding first prediction class variable, the forward greedy search method is used to perform attribute tree learning to obtain the initial attribute tree.
The data classification method based on dynamic Bayesian network classifier according to claim 3, characterized in that the following formula is used to determine the first classification accuracy:

in,
accuracy(D[n,T],T 0 ) is the first classification accuracy, D[n,T] is the time series sample data set, n is the number of sample attribute variables, T is the total period, and T 0 is the test threshold , c prediction [t] is the first prediction class variable of the sample attribute variable at time point t, and c true [t] is the true class variable of the sample attribute variable at time point t.
The data classification method based on a dynamic Bayesian network classifier according to claim 3, characterized in that the Bayesian classification method is determined based on the first classification accuracy, the second classification accuracy, the initial attribute tree and the optimal attribute tree. The weight coefficient of the Yeasian network classifier is used to obtain the optimal classifier, including:

Estimate the probability distribution of the initial attribute tree and the probability distribution of the optimal attribute tree;

According to the probability distribution of the initial attribute tree, the probability distribution of the optimal attribute tree, the first classification accuracy and the second classification accuracy, the weight coefficient of the Bayesian network classifier is determined.
The data classification method based on a dynamic Bayesian network classifier according to claim 6, characterized in that the following formula is used to determine the weight coefficient of the Bayesian network classifier:

Among them, p new is the weight coefficient of the Bayesian network classifier, α is the first classification accuracy, β is the second classification accuracy, p before is the probability distribution of the initial attribute tree, and p after is the probability of the optimal attribute tree. distributed.