CN111126241B

CN111126241B - Electroencephalogram mode extraction method based on optimal sequence feature subset

Info

Publication number: CN111126241B
Application number: CN201911319017.2A
Authority: CN
Inventors: 臧明文; 黄刚
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2019-12-19
Filing date: 2019-12-19
Publication date: 2022-04-22
Anticipated expiration: 2039-12-19
Also published as: CN111126241A

Abstract

The invention discloses an electroencephalogram mode extraction method based on an optimal sequence feature subset, which aims at the high-precision classification requirement of a brain-computer interface system on electroencephalogram data classification. Through PAA dimension reduction and SAX symbolization, the complexity of data is reduced, and in addition, similarity calculation between data is replaced by a Hash mode, so that the speed and the efficiency are higher. A discrimination is defined for determining the discriminative power between the symbol subsequences for different electroencephalogram signals.

Description

Electroencephalogram mode extraction method based on optimal sequence feature subset

Technical Field

The invention relates to design and implementation of electroencephalogram data processing, feature discovery, mode classification and mode mining methods based on sequence feature subsets, aims to form an electroencephalogram analysis mining system based on the sequence feature subsets, and belongs to the cross field of electroencephalogram analysis and mode mining technologies.

Background

The optimal feature subset is a subsequence of the feature subset that maximally represents a class. The classification method based on the feature subset has the advantages of high classification accuracy, high classification speed, strong interpretability and the like. The feature subset is a number of subsequences with discrimination that can express the greatest difference between different sequence data. In addition to performing efficient classification work, the feature subset has intuitively interpretable sequence features. The conventional feature subset discovery algorithm principle is described as follows, calculating and comparing the classification information gains of all possible candidate subsequences from a set of all possible subsequences, and finally selecting the subsequence with the largest difference between classes as the optimal feature subsequence. Therefore, the subsequence feature extraction mode is a promising approach in the field of electroencephalogram analysis.

Brain electrical patterns have become a hotspot of research as it is increasingly used in gaming applications and stroke rehabilitation to convert brain signals of a task under imagination into the expected movement of paralyzed limbs. For example, a wheelchair controlled by a brain-computer interface to enable a disabled person to operate around a house and perform basic tasks. Furthermore, through the brain-computer interface, we can detect in advance that a person is about to suffer a seizure so as to inform them in advance to prevent accidents or serious injuries. Brain-computer interface sensors have been widely used in the field of electroencephalographic signals obtained by non-invasive sensors due to their low cost, ease of use, and lack of any surgery required by invasive sensors.

Some results have been obtained in the fields of using electroencephalogram signals to drive dependable neural plasticity or rehabilitation robots, but brain-computer interfaces for rehabilitation are still an emerging field. Thus, the electroencephalogram signals can be used to more accurately classify different tasks, which is not only beneficial for gaming and rehabilitation, but also helps to better detect diseases or abnormal behaviors, such as epilepsy, sleep apnea, sleep stages, and drowsiness detection. However, data transmitted by the sensor faces the problems of noise interference, real data distortion, large data dimension and the like. Therefore, a method for classifying different types of electroencephalogram signals with high precision is highly required to improve the performance of a brain-computer interface system.

The optimal characteristic subsequence is used for searching the characteristic with the minimum data difference in the electroencephalogram signal, and only one part of sequence data is used when the characteristic needs to be distinguished, so that compared with a method for training and classifying by using complete data samples, the method has stronger anti-interference performance and more accurate data classification and extraction. In addition, the method has the advantage of strong interpretability due to the fact that the optimal subset is extracted, and the difference is acquired while the characteristics are output, so that basis is provided for analyzing the reason of electroencephalogram data.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the high-precision classification requirement of a brain-computer interface system on classification of electroencephalogram data, the invention provides an electroencephalogram mode extraction method based on an optimal sequence feature subset.

The technical scheme is as follows: in order to achieve the purpose, the invention adopts the technical scheme that:

an electroencephalogram mode extraction method based on an optimal sequence feature subset comprises the following steps:

step 1, acquiring real-time electroencephalogram data through a sensor, sending a request to a central server, and requesting to send the electroencephalogram data.

Step 2, the central server receives the sensor request, receives data and processes the data; the acquired data is processed preliminarily to obtain an original data sequence C ═ C₁c₂c₃,…,c_n) And n is the length of the original data, wherein C belongs to D, and D is the original data set.

And 3, carrying out segmented aggregation approximate representation on the original data C, and reducing the data dimension.

Wherein j is more than or equal to 1 and less than or equal to n, t is more than or equal to 1 and less than or equal to w,

representing the sequence after linear segmentation, w representing the number of data segments in the segment aggregation approximation, c_jRepresenting the elements in the original sequence and,

the representation segments represent elements in the post-sequence.

And 4, performing symbolic conversion on the data after the dimensionality reduction representation, and converting the data into a letter sequence representation corresponding to the letter space.

Wherein i is more than or equal to 1 and less than or equal to w,0<p-1<p<1,

Representing a segmented sequence of symbolized representations, alpha_pRepresenting symbols, β, corresponding to symbol mappings_p-1And beta_pMapping of representation to symbol alpha_pThe corresponding interval value.

And 5, the server transmits the symbolized sequence to a central online classifier for classification. And if the result is matched with the normal electroencephalogram signal, no response is made, and the sensor continues to wait for a sensor request. And if the obtained result is matched with the abnormal electroencephalogram signal, feeding back a target connected with the sensor.

The classification processing in step 5 includes the steps of:

the initialization of the classifier in the server requires a set of labeled data to be trained to obtain a sufficiently reliable and accurate feature subset, at step 51, the optimal feature sequence in the feature subset.

Set＝{s₁,s₂,…,s_k,…,s_l}

Wherein k is more than or equal to 1 and less than or equal to l, Set represents an optimal subsequence Set, and s_kRepresenting a certain sequence of features in the set, and l representing the current number of features.

Step 52, because the original data may have the same type of data mapped to different symbolized sequences after being mapped to the space, a similar sequence is obtained by covering partial letters of the symbolized sequences so as to solve the sequence ordering and the abnormal influence.

And 1, round:

and 2, round 2:

round m-1:

q-1 and, q is 0 in position;

wherein,

to representA length m of the symbolized sequence, m representing the sequence length,&representing the logical and operation of the corresponding element. Q-1 is not less than 1<q.ltoreq.m, q-1 and q representing two adjacent positions in the sequence.

In step 53, the symbol sequence covered in step 52 in each round is mapped into the Hash table T by a Hash operation, and the signed sequence having the same Hash value is subjected to a self-increment operation.

Step 54, storing the number of the symbolic sequences with Hash collision in the Hash table T, and calculating the influence degree of each symbol on the classification by defining the discrimination Dist.

Dist＝d_far+d_close

Wherein d is_farRefers to a distinguishing value between other classes and classes outside, d_closeRefer to discriminative values for this and other classes.

Step 55, extracting k candidate subsequences from all the symbolic subsequences with the highest discrimination value in each class, and then calculating the integral information entropy I with the data set D_SSelecting the entropy with the maximum information

As a basis for an optimal subset of features of the class, wherein,

representing entropy, I representing amount of information before classification, I_SIndicating the amount of information after classification.

And 56, adding the optimal symbol self-sequence into the Set if better difference entropy can be obtained for the original data acquired by each sensing, and updating the data in the Set. And judging whether the overall subsequence is optimal or not by using the expectation rate.

If the expectation rate is within a preset threshold of the sequence expectation rate, the symbol sequence is matched with the sequences in the feature subset, whether the symbol sequence is a normal/abnormal sequence or not can be judged according to the matched feature sequence, and if the symbol sequence is matched with the abnormal sequence, a result can be fed back to a sensor user in a feedback mode.

Preferably: the preliminary processing in step 2 includes performing a verify-denormalization operation.

Preferably: the average value of the data after the segmentation and aggregation processing satisfies

Will all map to the letter alpha_pThe above.

Compared with the prior art, the invention has the following beneficial effects:

the invention does not train samples on the whole sequence of each sample of a data set like the traditional electroencephalogram analysis method when processing the electroencephalogram data, and establishes the mapping of the original data in a symbolic expression mode through dimension reduction processing. The method not only reduces the complexity of the data to a great extent, but also avoids the interference of noise on the whole data received by the algorithm, and has higher precision. On the other hand, the Hash matching mechanism is used in the model, so that the matching and calculation amount among symbols is reduced to a great extent, the access speed is improved, the requirement of back-end data on the sensor is further reduced, the requirement of electroencephalogram equipment on hardware is reduced, and the barrier on the hardware is broken.

In conclusion, the invention can overcome the problems of low precision and high complexity in the traditional electroencephalogram mode extraction, and because of using the optimal characteristic subsequence matching mode, most noise interference on data is avoided, the calculation complexity is reduced, and the system operation efficiency is improved.

Drawings

FIG. 1 is a flow chart of user data discrimination.

FIG. 2 illustrates a user data and background service interaction flow.

Fig. 3SAX symbol represents a diagram.

Fig. 4 optimal feature subsequence selection process.

Detailed Description

The present invention is further illustrated by the following description in conjunction with the accompanying drawings and the specific embodiments, it is to be understood that these examples are given solely for the purpose of illustration and are not intended as a definition of the limits of the invention, since various equivalent modifications will occur to those skilled in the art upon reading the present invention and fall within the limits of the appended claims.

An electroencephalogram mode extraction method based on an optimal sequence feature subset is shown in fig. 1-4, and comprises the following steps:

And 2, the central server receives the sensor request, receives data and processes the data. And (4) carrying out primary processing on the acquired data, including checking, de-duplication, normalization and other operations, to obtain original data C.

And 3, performing segmented aggregation approximate representation on the original data C, and reducing data dimensionality to improve processing speed and precision. Segment aggregation Approximation (PAA segmentation for short): according to the dimensionality reduction method for the high-dimensional sequence data, data is segmented and approximately polymerized, the average value of the data in each segment is used as the approximate value of each segment of data to replace, the data dimensionality can be effectively reduced, and the efficiency is improved:

representing elements in a sequence after a segmented representation

Step 4, symbol aggregation Approximation (symbololic Aggregate Approximation): on the basis of segmentation aggregation approximation, the segmented data is represented symbolically, so that the influence of data fluctuation in a range can be reduced, and the precision and the classification efficiency are improved. Therefore, the data after the dimensionality reduction is symbolically converted into the letter sequence representation corresponding to the letter space, that is, the segmented data is mapped onto the alphabet through Gaussian distribution, and the corresponding mapping is determined by the following formula:

wherein i is more than or equal to 1 and less than or equal to w,0<p-1<p<1,

The data are mapped to letter spaces according to Gaussian distribution, letters corresponding to each data can be obtained by looking up a distribution table, wherein beta is a threshold value of the data space corresponding to each letter, namely, the data obtained in a certain range after the segmentation aggregation approximation can be mapped to the same letter space.

The classification processing in step 5 includes the steps of:

Set＝{s₁,s₂,…,s_k,…,s_l}

Wherein k is more than or equal to 1 and less than or equal to l, Set represents an optimal subsequence Set, and s_kIn a representation setA certain sequence of features, l, represents the current number of features.

Step 52, since it is sure that the same type of data is mapped to different symbolized sequences after the original data is mapped to the space, in order to solve this problem, a similar sequence can be obtained by covering partial letters of the symbolized sequences, so as to solve the sequence ordering and the abnormal influence.

And 1, round:

and 2, round 2:

and (4) the p-th round:

q-1 and, q is 0 in position;

wherein,

indicating a certain length m of the symbolized sequence, m indicating the sequence length,&representing the logical and operation of the corresponding element. Q-1 is not less than 1<q.ltoreq.m, q-1 and q representing two adjacent positions in the sequence.

In step 53, in order to ensure efficiency and accelerate the similarity measurement between symbols, direct similarity calculation between each pair of sequences in an N × N manner cannot be adopted, so that a Hash operation is used instead of direct measurement. The sequence of symbols covered in step 52 in each round is mapped by a Hash operation to a Hash table T, and the sequence of symbols having the same Hash value is incremented.

Dist＝d_far+d_close

Wherein d is_farRefers to a distinguishing value between other classes and classes outside, d_closeThe method refers to the discrimination of the type and other types which can be obtained by summing the discrimination values, so that the higher the symbolic subsequence is for the type A discrimination value, the higher the discrimination value is for the non-type A discrimination value, and the higher the discrimination is, and the method is very suitable for serving as a criterion for discriminating between the types.

As a basis for an optimal subset of features of the class, wherein,

And 56, adding the optimal symbol self-sequence into the Set if better difference entropy can be obtained for the original data acquired by each sensing, and updating the data in the Set. And judging whether the overall subsequence is optimal or not by using the expectation rate for the overall accuracy of the model.

The operation method of the invention is as follows:

step A: the electroencephalogram signal is acquired by the wearable device by a device user, the electroencephalogram sensor on the wearable device has the characteristics of low power consumption and low design requirement, a simple detection and judgment process is needed, and data can be submitted to an application end, as shown in figure 2.

And B: because the sensor does not have the capability of direct communication with the server, the communication equipment (such as a computer or a mobile phone) of an equipment user is required to send a request to the server end by the communication equipment for the data collected by the sensor, as shown in (c), and the interactive information sent by the processing server is obtained and presented to the user for the judgment of the next action.

And C: if so, the processor processes the data transmitted by the communication equipment, and then returns the data to the equipment through the process of (iv) and is processed by the equipment-side software.

The initializing optimal sequence selection process of the process is shown by a flow chart 4 and comprises the following specific steps:

step C1: and cleaning all data of the original data set, and performing operations such as duplicate removal, normalization, abnormal value processing and the like.

Step C2: cleaning data C ═ C₁,…,c_n) Represented by PAA segmentation into

Step C3: as shown in fig. 3, the segmented sequence is further represented as a symbol sequence by the SAX method

The value after PAA segmentation treatment satisfies

Will map to the letter b and a segment sequence can be represented as a character sequence of bccbaaabc.

Step C4: the character sequence is randomly masked, and the left side of the table I and the table II shows that the gray area is a masking part, and a part of the character is randomly masked each time. And further carrying out Hash operation on unmasked parts in the masked sub-character sequence, adding one to the numerical value of the sequence corresponding to the operation result, and carrying out an adding operation on the numerical value when a plurality of characters generate the same Hash value collision. The right side of the table I and the table II shows the result table of hash operation of different sequences, and then different sequences are added and subtracted to obtain the discrimination.

Table 1: randomly covering the first two characters

Table 2: random covering for selecting middle two characters

Step C5: if the length of the character sequence is l, the characters at two positions are taken to be randomly covered and Hash operation is required to be carried out at most

After this operation, the mapping operation of the Hash table in step 4 is completed.

Table 3: discrimination solution process

Step C6: using Dist ═ d_far+d_closeThe calculated discriminations were calculated as shown in table 3.

Step C7: and extracting the 10 sub-symbol sequences with the highest discrimination.

Step C8: information entropy of 10 sub-symbol sequences of a calculator on an original data set

If it satisfies

Then the sub-symbol sequence is decoded

And adding the mixture into the Set.

After the initialization of the server is completed, a specific processing flow for receiving the request is shown in fig. 1.

Step D1: and cleaning all data of the original data set, and performing operations such as duplicate removal, normalization, abnormal value processing and the like.

Step D2: cleaning data C ═ C₁,…,c_n) Expressed by PAA segmentation into

Step D3: then the segmented sequence is expressed into a symbol sequence by an SAX method

The value after PAA segmentation treatment satisfies

Is mapped to the letter b and a segment sequence can be represented as a sequence of characters.

Step D4: the symbol sequence is matched with the symbol sequence of a tree formed by the character good sequences in the feature subset from the tree root to the leaf node through a classifier.

Step D5: and if the normal sequence is matched, returning a normal value to the user software, and if the abnormal sequence is matched, returning an abnormal value warning.

Aiming at the high-precision classification requirement of a brain-computer interface system on classification of electroencephalogram data, the brain-computer interface can classify signals by finding out an optimal characteristic subsequence of an electroencephalogram signal, and provides a relevant strong explanatory result while ensuring the classification precision of the electroencephalogram signal. Through PAA dimension reduction and SAX symbolization, the complexity of data is reduced, and in addition, similarity calculation between data is replaced by a Hash mode, so that the speed and the efficiency are higher. A discrimination is defined for determining the discriminative power between the symbol subsequences for different electroencephalogram signals.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1. An electroencephalogram mode extraction method based on an optimal sequence feature subset is characterized by comprising the following steps:

step 1, acquiring real-time electroencephalogram data through a sensor, sending a request to a central server, and requesting to send the electroencephalogram data;

step 2, the central server receives the sensor request, receives data and processes the data; the acquired data is processed preliminarily to obtain an original data sequence C ═ C₁c₂c₃,…,c_n) N is the length of original data, wherein C belongs to D, and D is an original data set;

step 3, performing segmented aggregation representation on the original data C, and reducing data dimensionality:

representing elements in the sequence after the segmented representation;

step 4, performing symbolic conversion on the data after the dimensionality reduction representation, and converting the data into a letter sequence representation corresponding to a letter space:

wherein i is more than or equal to 1 and less than or equal to w, p-1 is more than 0 and more than p and less than 1,

representing a segmented sequence of symbolized representations, alpha_pRepresenting symbols, β, corresponding to symbol mappings_p-1And beta_pMapping of representation to symbol alpha_pA corresponding interval value;

step 5, the server transmits the symbolized sequence into a central online classifier for classification; if the result is matched with the normal electroencephalogram signal, no response is made, and the request of the sensor is continuously waited; if the obtained result is matched with the abnormal electroencephalogram signal, feeding back a target connected with the sensor;

the classification processing in step 5 includes the steps of:

step 51, a set of labeled data is required for training the classifier initialization in the server to obtain a feature subset with sufficient confidence and sufficient accuracy, where the optimal feature sequence in the feature subset is:

Set＝{s₁,s₂,…,s_k,…,s_l}

wherein k is more than or equal to 1 and less than or equal to l, Set represents an optimal subsequence Set, and s_kRepresenting a certain characteristic sequence in the set, and l represents the current characteristic quantity;

step 52, because the same type of data is mapped to different symbolic sequences after the original data is mapped to the space, similar sequences are obtained by covering partial letters of the symbolic sequences so as to solve sequence ordering and abnormal influence;

and 1, round:

and 2, round 2:

and (4) the p-th round:

q-1 and, q is 0 in position;

wherein,

indicating a certain length m of the symbolized sequence, m indicating the sequence length,&a logical and operation representing the corresponding element; q is more than or equal to 1 and more than or equal to q and less than or equal to m, and q-1 and q represent two adjacent positions in the sequence;

step 53, in each round, the symbol sequence covered in step 52 will be mapped to the Hash table T through Hash operation, and the symbolic sequence with the same Hash value will be subjected to auto-increment operation;

step 54, storing the number of the symbolic sequences with Hash collision in a Hash table T, and calculating the influence degree of each symbol on classification by defining the discrimination Dist;

Dist＝d_far+d_close

wherein d is_farRefers to a distinguishing value between other classes and classes outside this class, d_closeRefer to discriminative values for this and other classes;

step 55, extracting k candidate subsequences from all the symbolic subsequences with the highest discrimination value in each class, and then calculating the integral information entropy I with the data set D_SSelecting the minimum entropy of information Min (I ″)_S) As the basis of the optimal feature subset of the class, wherein I ″_S＝I-I_S，I`_SRepresenting entropy, I representing amount of information before classification, I_SRepresenting the amount of information after classification;

step 56, aiming at the original data acquired by each sensing, if better difference entropy can be obtained, adding the optimal symbol self-sequence into the Set, and updating the data in the Set; judging whether the overall subsequence is optimal or not by using the expectation rate;

and if the expected rate is within a preset threshold value of the sequence expected rate, matching the symbol sequence with the sequences in the feature subset, judging whether the symbol sequence is a normal sequence or not according to the matched feature sequence, and if the symbol sequence is matched with the abnormal sequence, feeding back a result to a sensor user in a feedback mode.

2. The electroencephalogram mode extraction method based on the optimal sequence feature subset according to claim 1, which is characterized in that: the preliminary processing in step 2 includes performing a verify-denormalization operation.

3. The electroencephalogram mode extraction method based on the optimal sequence feature subset according to claim 2, which is characterized in that: the average value of the data after the segmentation and aggregation processing satisfies

Will all map to the letter alpha_pThe above.