CN113869382A

CN113869382A - Semi-supervised learning epilepsia electroencephalogram signal identification method based on domain embedding probability

Info

Publication number: CN113869382A
Application number: CN202111084540.9A
Authority: CN
Inventors: 倪彤光; 顾晓清; 蒋亦樟; 薛婧; 钱鹏江
Original assignee: Changzhou University
Current assignee: Changzhou University
Priority date: 2021-09-16
Filing date: 2021-09-16
Publication date: 2021-12-31

Abstract

The invention relates to the technical field of electroencephalogram signal identification, in particular to a semi-supervised learning epilepsy electroencephalogram signal identification method based on field embedding probability, which comprises the following steps: the method comprises the following steps: 1. collecting and preprocessing electroencephalogram signals; 2. constructing a marker set X_lAnd unlabeled set X_u(ii) a 3. Forming a homogeneous sample pair set M and a heterogeneous sample pair set D; 4. constructing a semi-supervised incidence matrix on the data set X; 5. projecting each sample in X to a low-dimensional space; 6. constructing a domain embedding probability matrix; 7. updating the category labels corresponding to the k one-dimensional entropy feature vectors; 8. updating X_l、X_uM and D; 9. judgment of X_uWhether the set is empty; 10. for the tested brain electrical signalSample x_testAnd classifying to obtain an identification result. The invention utilizes the characteristic projection and the field embedding technology, maintains the local structure of the data, has higher distinguishability and discriminability of the electroencephalogram signal low-dimensional representation, and can accurately classify and identify the electroencephalogram signal of the epilepsy.

Description

Semi-supervised learning epilepsia electroencephalogram signal identification method based on domain embedding probability

Technical Field

The invention relates to the technical field of electroencephalogram signal identification, in particular to a method for identifying epilepsia electroencephalogram signals based on field embedding probability through semi-supervised learning.

Background

Epilepsy is a cerebral dysfunction disease, during which a patient may produce temporary vague consciousness or uncontrollable convulsion, causing great physical and mental harm to the patient and his family. Electroencephalography can accurately record various waveforms when epilepsy occurs, so that electroencephalogram analysis is an important basis for diagnosing epileptic seizures. The electroencephalogram signals are characterized by randomness and non-stationarity, and clinicians can make subjective judgment on electroencephalograms by combining priori knowledge, but the electroencephalograms are easy to make mistakes and low in efficiency. The automatic epilepsia electroencephalogram signal identification and monitoring technology is beneficial to improving the manual diagnosis accuracy and reducing the workload. In the big data era, machine learning technology is highly regarded as an important means in electroencephalogram analysis. The first step of epilepsia electroencephalogram signal identification based on machine learning is acquisition of electroencephalogram signals. The non-invasive electroencephalogram signal acquisition only needs to stick the electrodes on the corresponding scalp surface, and the acquisition mode is simple and convenient and is harmless to a tested object, so that the non-invasive electroencephalogram signal acquisition is widely applied. The second step is the preprocessing of the brain electrical signals. The electroencephalogram signals collected from the scalp electrodes are very weak and are often mixed with various artifacts and noises. Therefore, after the electroencephalogram signals are collected, an effective preprocessing method is needed to remove redundant information, reduce the dimension and extract useful electroencephalogram signals. Common preprocessing methods include electrode screening, deletion of artifacts such as electrooculogram and myoelectricity, and other time-domain filtering and spatial filtering methods. And thirdly, extracting the characteristics of the electroencephalogram signals. At present, the research on an electroencephalogram signal method mainly focuses on the aspects of time domain, frequency domain, time-frequency combination analysis, a spatial filtering method, nonlinear dynamics analysis and the like. After the effective electroencephalogram signal features are extracted, the features need to be classified to realize automatic epilepsy detection. Therefore, the classification algorithm is a key link for designing the epilepsy recognition task.

Researchers have used a variety of methods to address this problem. Zhouyou provides an electroencephalogram detection method and device by utilizing wavelet neural network, and the extracted feature vectors are sent into a classifier obtained by the wavelet neural network, so that the abnormal electroencephalogram signals are marked. Gong Guang hong et al proposed the method for automatically identifying multi-stage epilepsia electroencephalogram signals based on a supervised gradient raiser to examine epilepsia signals by a gradient raiser classifier. And gac bin et al propose epileptic seizure detection equipment and early warning system based on multi-data acquisition, which train a plurality of decision trees in a random forest classifier by using a plurality of extracted characteristic parameters as characteristic vectors to form a random forest model. Meizhen et al propose an EEG signal processing method and an epilepsy detection system, which have the effect of performing data preprocessing on EEG signals, eliminating frequency bands, extracting time domains and features based on entropy, and finally selecting an optimal feature subset by using an improved correlation-based feature selection method.

However, these methods belong to the traditional supervised classification method, and it is necessary to acquire a large number of labeled electroencephalogram signal samples to train to obtain a classifier with good performance, and acquiring a large number of labeled electroencephalogram signal samples is a time-consuming, labor-consuming and financial process. Therefore, the automatic detection of the epilepsy under the condition of only a small amount of labeled samples has great research significance and practical value. In addition, the features of general electroencephalogram signals have high dimensionality, and the analysis of high-dimensional data is more difficult than that of low-dimensional data, and may contain useless and redundant feature information. These characteristics present a huge challenge to the practical processing of epileptic brain electrical signals.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: on the basis of improving the effect of the existing image segmentation method, in view of the characteristic difference of electroencephalogram signals under different health states, under the scene of small samples and insufficient marked samples, the characteristic difference is amplified, expressed in a certain digital form and recognized by a classifier, and the automatic epilepsia electroencephalogram signal recognition method with less time consumption, strong applicability and high classification accuracy is formed. The method relates to a characteristic dimensionality reduction and classification method used in the automatic detection process of the epilepsia electroencephalogram signals, effectively ensures the integrity of local information of the electroencephalogram signals, enlarges the difference of the electroencephalogram information in different states, and improves the classification precision of the epilepsia electroencephalogram signals by using a proper semi-supervised learning model.

The technical scheme adopted by the invention is as follows: the method for recognizing the epilepsia electroencephalogram signals based on the semi-supervised learning of the domain embedding probability comprises the following steps:

step 1: collecting original electroencephalogram signals of different categories and preprocessing the signals;

step 2: performing feature extraction on the preprocessed electroencephalogram signals, and obtaining a feature data set X containing n training samples after feature extraction₁,x₂,...,x_n}，x_iIs the ith feature vector in X, X_i∈R^dD denotes the dimension of the sample, the first l samples of the data x₁,x₂,...,x_lMarking category labels of the electroencephalogram signals, and marking the labels as X_l，X_lThe corresponding category label matrix is marked as Y_l＝{y₁,y₂,...,y_l}，Y_lIs a matrix of l rows and c columns, in which the label vectors

Represents a sample x_iIs the j-th class, c is the number of classes of the electroencephalogram signal, and the last (n-l) samples { X) of the sample set X_l+1,x_l+2,...,x_nIs denoted by X_u,X_uThe unlabeled category and the corresponding category label matrix is marked as Y_u＝{y_l+1,y_l+2,...,y_n}，Y_uIs a 0 matrix of (n-l) rows and c columns;

and step 3: tag matrix Y by class_lAt X_lMatching samples in the set to form a homogeneous sample pair set M and a heterogeneous sample pair set D, wherein M { (x)_i,x_j)|y_i＝y_j}，D＝{(x_i,x_j)|y_i≠y_j}；

And 4, step 4: constructing a semi-supervised incidence matrix U on the data set X, wherein the ith row and the jth column of elements U in the U_ijIs defined as:

wherein i is more than or equal to 1 and less than or equal to n, j is more than or equal to 1 and less than or equal to n, a is a positive number more than 1, and b is a positive number more than 1 and less than a;

and 5: recording the projection matrix as A ∈ R^d×eE is more than 0 and less than or equal to d, and each sample X in the data set X is processed by A_iProjection into a low-dimensional space R^eThe low dimensional features are expressed as:

z_i＝A^Tx_i， (2)

the projection matrix a is calculated as:

wherein I represents a unit matrix, | M | and | D | represent the number of sample pairs in the sets M and D, respectively, Tr { } represents the trace operation of the matrix,^Tthe transpose operation of the representation matrix is set

Introducing a Lagrange coefficient alpha, solving the formula (3) by using a Lagrange multiplier method, and obtaining:

to pair

Performing eigenvalue decomposition on the matrix, and updating the matrix A by taking eigenvectors corresponding to the largest e eigenvalues;

step 6: constructing a field embedding probability matrix S of a low-dimensional space, wherein the element S is positioned in the ith row and the jth column in the S_ijDenotes z_iSelection of z_jProbability as a neighbor, S_ijThe calculation formula of (A) is as follows:

wherein, dis (z)_i,z_j) Denotes z_iTo z_jEuclidean distance of (S)_ijIs satisfied with z_iAs a center, the identity matrix is the gaussian distribution of the covariance matrix;

and 7: inputting the semi-supervised incidence matrix U, the projection matrix A and the field embedding probability matrix S into a semi-supervised learning model of field embedding probability, and updating category labels corresponding to k minimum one-dimensional entropy eigenvectors;

step 7.1: the objective function of the semi-supervised learning model of the domain embedding probability is expressed as:

wherein | | | purple hair₂Expressing the 2-norm, λ is a regularization parameter, which is a positive real number, let G_ij＝S_ij+λU_ijIn the order of G_ijConstructing a matrix G for the matrix elements, order

With Q_iiConstructing a diagonal matrix Q for the matrix elements, splitting the Q matrix into 4 blocks after the l row and l column of the Q matrix

Equation (6) can be expressed as:

pair type (7)With respect to Y_uIs equal to 0 to obtain Y_uThe update expression of (1):

step 7.2: calculating Y_uThe one-dimensional entropy of the middle label vector is calculated by the following formula:

wherein, y_i∈Y_u，y_i,kRepresents a label vector y_iThe k component of (2), formula (9) represents the information content of each component in the tag vector and the uncertainty of the category to which the component belongs, the larger the entropy is, the larger the uncertainty of the category to which the component belongs is, and the smaller the entropy is, the smaller the certainty of the category to which the component belongs is;

step 7.3: taking out the eigenvectors corresponding to the k minimum one-dimensional entropies, and updating the class labels of the corresponding samples according to the maximum component values, namely if y is_iThe jth component is maximum, then set

And 8: adding the corresponding sample into X according to the class label corresponding to the k minimum one-dimensional entropy characteristic vectors_lIs collected from X_uDeleting in a centralized manner, and reconstructing a set M and a set D;

and step 9: judgment of X_uWhether the set is empty or not, if not, turning to the step 4, and if so, turning to the step 10;

step 10: collecting electroencephalogram signal sample x to be tested_testAfter preprocessing, the low-dimensional feature representation z of the image is obtained by calculation according to the formula (2)_testCalculating z_testAnd data set X_lIs represented by a low-dimensional feature of each sample z_iDis (z) of the distance between_i,z_test) Selecting dis (z)_i,z_test) Minimum r eigenvectors, according to voting methodThe category with the largest number of occurrences among the r feature vectors is taken as x_testAccording to the category of (1), finally according to x_testIs determined by the category of (x)_testWhether it is epileptic brain electrical signals.

The invention has the following beneficial effects:

1. the method is based on a semi-supervised machine learning algorithm, only a small part of marked samples are needed for model training, and a large amount of manual marking work is reduced;

2. according to the method, the low-dimensional representation of the samples is learned by utilizing a feature projection technology and a field embedding technology, and the local structure of data is reserved, and meanwhile, the pairing relation among the samples is fused into a model, so that the low-dimensional representation can have the distinguishing and distinguishing performance of epileptic electroencephalogram signals;

3. the method is characterized in that the difference and common dimensionality reduction task and classification task are completed in stages, and the low-dimensional representation and sample mark identification of a sample are aggregated in a learning task;

4. the invention reduces the requirements of semi-supervised learning on the labeled samples step by step through an iterative optimization mode until the sample labeling information of all training sets is identified.

Drawings

FIG. 1 is a flow chart of a domain embedding probability-based semi-supervised learning epileptic electroencephalogram signal identification method of the present invention;

FIG. 2 is a diagram of brain electrical signals in accordance with one embodiment of the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings and examples, which are simplified schematic drawings and illustrate only the basic structure of the invention in a schematic manner, and therefore only show the structures relevant to the invention.

Fig. 1 is a flowchart of a domain embedding probability-based semi-supervised learning epileptic electroencephalogram signal identification method according to an embodiment of the invention, and the method comprises the following 10 steps.

Step 1: acquiring different types of original brain electrical signals, preprocessing the acquired original brain electrical signals, taking an epilepsia brain electrical data set of the university of Bayon Germany as an example of the original brain electrical signals, as shown in fig. 2: the data Set is divided into five subsets of Set A to Set E, each subset comprises 100 samples of the same type, each sample comprises 4097 electroencephalogram time sequences, the data sampling frequency is 173.61Hz, the duration is 23.6s, artifacts are removed by artificial filtering at 0.53-40 Hz, and the subsets of Set A and Set B are electroencephalogram signals collected by 5 healthy subjects under the eye opening and eye closing states respectively; the Set C and Set D subsets are respectively electroencephalogram signals collected by 5 epileptic patients in a focus contralateral area and a focus area in a seizure intermission period; the Set E subset is the electroencephalogram signals collected from the focal zone during the attack period. In this embodiment, Set a and Set B are classified as normal data, Set C and Set D are classified as episodic data, and Set E is considered as episodic data. 80% of the data set was used for training and the remaining 20% was used for testing. Preprocessing the acquired original electroencephalogram signals through an open source tool box EEGlab of MATLAB, including down-sampling, filtering, re-referencing electrodes, baseline correction, independent component analysis and the like, and finally obtaining pure noise-free electroencephalogram signals as far as possible;

step 2: the feature extraction is carried out on the preprocessed electroencephalogram signals, 4-layer discrete wavelet transform decomposition is carried out on the electroencephalogram data set by adopting a dmey wavelet basis in the embodiment to obtain wavelet packet nodes of 16 frequency band spaces, and the extraction of 3 types of features is carried out on original signals with the frequency below 27.13 HZ. The first type of extracted time domain features comprise descriptive statistical features, and the extracted features comprise a mean, a median, a minimum, a maximum, skewness, a standard deviation, a peak, a first quartile, a third quartile and a quartile interval; the second class of extracted entropy-based features: the physical meaning of the sample entropy is that the signal complexity is reflected by measuring the probability of generating a new mode in a signal, and the larger the value is, the more complicated the corresponding sample sequence is; the third kind of extracted time-frequency domain features: the physical significance of the frequency band energy characteristic is to reflect the energy of the electroencephalogram signal in a time-frequency localization space, each orthogonal wavelet packet space projection component of the original signal on each layer of decomposition level represents the time-frequency localization information of the source signal on a corresponding time-frequency domain resolution space, and the 3 types of characteristics are 58-dimensional in total. Obtaining a feature data set X ═ X containing n training samples after feature extraction₁,x₂,...,x_n}，x_iIs the ith feature vector in X, X_i∈R^dD denotes the dimension of the sample, the first l samples of the data x₁,x₂,...,x_lMarking category labels of the electroencephalogram signals, and marking the labels as X_l，X_lThe corresponding category label matrix is marked as Y_l＝{y₁,y₂,...,y_l}，Y_lIs a matrix of l rows and c columns, in which the label vectors

Represents a sample x_iIs the j-th class, c is the number of classes of the electroencephalogram signal, and the last (n-l) samples { X) of the sample set X_l+1,x_l+2,...,x_nIs denoted by X_u,X_uThe unlabeled category and the corresponding category label matrix is marked as Y_u＝{y_l+1,y_l+2,...,y_n}，Y_uIs a 0 matrix of (n-l) rows and c columns, where n is 160 and l is 20 in this embodiment;

wherein, 1 ≦ i ≦ n, 1 ≦ j ≦ n, a is a positive number greater than 1, b is a positive number greater than 1 and less than a, in this embodiment a ═ 3, b ≦ 2;

and 5: recording the projection matrix as A ∈ R^d×eIn this embodiment, e-20, each sample X in the data set X is represented by a_iProjection into a low-dimensional space R^eThe low dimensional features are expressed as:

z_i＝A^Tx_i， (2)

the projection matrix a is calculated as:

to pair

and 7: inputting the semi-supervised incidence matrix U, the projection matrix A and the field embedding probability matrix S into a semi-supervised learning model of field embedding probability, and determining category labels corresponding to k one-dimensional entropy eigenvectors, wherein k is 7 in the embodiment;

Equation (6) can be expressed as:

to formula (7) with respect to Y_uIs equal to 0 to obtain Y_uThe update expression of (1):

wherein, y_i∈Y_u，y_i,kRepresents a label vector y_iThe k thThe component, formula (9) expresses the information content of each component in the label vector and the uncertainty of the belonged category, the larger the entropy is, the larger the uncertainty of the belonged category is, and the smaller the entropy is, the smaller the certainty of the belonged category is;

step 10: collecting electroencephalogram signal sample x to be tested_testAfter preprocessing, the low-dimensional feature representation z of the image is obtained by calculation according to the formula (2)_testCalculating z_testAnd data set X_lIs represented by a low-dimensional feature of each sample z_iDis (z) of the distance between_i,z_test) Selecting dis (z)_i,z_test) The smallest r feature vectors, and the category with the largest number of occurrences among the r feature vectors is defined as x by voting_testAccording to the category of (1), finally according to x_testIs determined by the category of (x)_testWhether the signal is an epilepsia electroencephalogram signal or not, in this embodiment, r is 5, in the embodiment, the indexes for evaluating the classification performance include classification accuracy, recall rate and F-Score, and table 1 is the result of each statistical index of this embodiment. As can be seen from the results in the table 1, the classifier has higher precision on the classification result of the 3 types of electroencephalogram signals, and can efficiently realize automatic identification on the electroencephalogram signals of the epilepsy.

TABLE 1 statistical indexes of the present embodiment

	Rate of accuracy	Recall rate	F-Score
				Set A and Set B	94.56％	95.05％	94.88％
Set C and Set D	95.24％	95.75％	95.60％
				Set E	96.49％	96.91％	96.64％

The method is based on a semi-supervised machine learning algorithm, only a small part of marked samples are needed for model training, and a large amount of manual marking work is reduced; the low-dimensional representation of the samples is learned by utilizing a feature projection technology and a field embedding technology, and the local structure of data is kept, and meanwhile, the pairing relation among the samples is fused into a model, so that the low-dimensional representation can have the distinguishability and the discriminability of the epileptic electroencephalogram signals; distinguishing common dimensionality reduction tasks and classification tasks, and completing the common dimensionality reduction tasks and the classification tasks in stages, and aggregating low-dimensional representation of samples and sample mark identification into a learning task; and through an iterative optimization mode, until the sample marking information of all the training sets is identified, the requirement of semi-supervised learning on the marked samples is gradually reduced.

In light of the foregoing description of the preferred embodiment of the present invention, many modifications and variations will be apparent to those skilled in the art without departing from the spirit and scope of the invention. The technical scope of the present invention is not limited to the content of the specification, and must be determined according to the scope of the claims.

Claims

1. The method for recognizing the epilepsia electroencephalogram signals based on the semi-supervised learning of the domain embedding probability is characterized by comprising the following steps:

and step 3: tag matrix Y by class_lAt X_lCarrying out sample pairing in the set to form a homogeneous sample pair set M and a heterogeneous sample pair set D, wherein M is a great face(x_i,x_j)|y_i＝y_j}，D＝{(x_i,x_j)|y_i≠y_j}；

z_i＝A^Tx_i， (2)

the projection matrix a is calculated as:

wherein, I represents a unit matrix, | M | and | D | represent the number of sample pairs in the sets M and D, respectively, Tr { } represents the trace operation of the matrix, T represents the transposition operation of the matrix, and is set

to pair

step 10: collecting electroencephalogram signal sample x to be tested_testAfter preprocessing, the low-dimensional feature representation z of the image is obtained by calculation according to the formula (2)_testCalculating z_testAnd data set X_lThe low-dimensional feature of each sample in z_iDis (z) of the distance between_i,z_test) Selecting dis (z)_i,z_test) The smallest r feature vectors, and the category with the largest number of occurrences among the r feature vectors is defined as x by voting_testAccording to the category of (1), finally according to x_testIs determined by the category of (x)_testElectroencephalogram for determining whether epilepsy is presentA signal.

2. The domain embedding probability-based semi-supervised learning epileptic brain electrical signal identification method according to claim 1, wherein the objective function of the semi-supervised learning model of the domain embedding probability of the step 7 is expressed as:

Equation (6) can be expressed as:

3. the domain embedding probability-based semi-supervised learning epileptic electroencephalogram signal identification method according to claim 1, wherein the step of updating the category labels corresponding to the k minimum one-dimensional entropy feature vectors in the step 7 comprises: calculating Y_uOne dimension of the medium label vectorThe calculation formula of the entropy and the one-dimensional entropy is as follows:

wherein, y_i∈Y_u，y_i,kRepresents a label vector y_iThe k component of (2), formula (9) represents the information content of each component in the tag vector and the uncertainty of the category to which the component belongs, the larger the entropy is, the larger the uncertainty of the category to which the component belongs is, and the smaller the entropy is, the smaller the certainty of the category to which the component belongs is; extracting k eigenvectors corresponding to the minimum one-dimensional entropy;

updating class labels of corresponding samples according to their maximum component values, i.e. if y_iThe jth component is maximum, then set