Ikaro Silva

Designing Classifiers for Brain Computer Interface

ECE G313

Ikaro Silva

Introduction

Electroencephalographic signals (EEG signals) are a relevant source of information about motor imagery and motor activity. The information on EEG signals has been successfully used by researches to create brain-computer interfaces (BCIs) (Guger et al, 2001). The BCI allows a user to take control of an external source (ie: a computer mouse) simply by imagining movements. This is mainly done through a pattern recognition analysis of the user’s EEG signal. The applications of such an EEG classifier goes beyond BCI’s, they have also been used to restore some movement control to paraplegic patients. The EEG classifiers would discriminate which limb a patient would be trying to move and sends it’s output to a controller that would electrically stimulate the limb’s muscle (hybrid brain-machine interface, HBMI) (Wolpaw et al, 2000). The goal of this project is to compare the performance (MSE) of three classifiers on their ability to determine which index finger (left/right) a subject moved based on his spectral EEG data. The performance of the classifiers will also be compared with third-party classifiers used on the same dataset. The three types of classifiers to be used in this project will be: Naïve Bayes Estimator, Hidden Markov Model, and Support Vector Machine.

Background

EEG signals are known to be usually non-stationary, stochastic, and contain a high SNR (Baillet et al, 2001). In addition, the noise sources on the signal are highly non-gaussian in nature. Processing of EEG signals for pattern classification problems are usually dealt with by collecting a large number of event-triggered trials and averaging (Guger et al, 2001)(Baillet et al,2001)(Xu et al 1999). The EEG signals has been studied and analyzed in the frequency domain over four major spectral regions (Maurer 1991). The four major spectral regions are: δ (delta, 0.5-3.5 Hz), θ (theta, 4.0-7.5 Hz), α (alpha, 8.0-11.5 Hz), and β (beta, 12.0-Nyquist Hz) (Maurer & Dierks, 1991). It is well known that motor activity produces event related synchronization and desynchronization on some of these frequency bands (Maurer 1991, Dolfus 1986, Guger et al 2001, Schloegl, 1997). The following features have been noticed in the EEG spectrum related specifically to the activation of motor activity

· Decreased θ, α, and β1 activity. In particular, significant decrease in θ activity on the left central lead for the right-hand movements and on both central (p<0.01) and parietal (p<0.05) leads for the left-hand movement. In addition, β1 showed high reactivity (p<0.05) on the parieto-controlateral leads. (Dolfus, 1986).

· Event related desynchronization (ERD) was observed on α and β2 bands on the contralateral hemisphere (electrodes C3 or C4) for subjects imagining hand movements. On some of the subjects, event related synchronization (ERS) was also observed on the ipsilateral area (Schloegl,1997).

· Activation of hand area neurons either by the preparation for a real movement or by the imagination of a movement, is accompanied by the circumscribed ERD over the hand area (desynchronization occurs in the contralateral area) (Guger et al 2001).

· Movement related potentials (MRPs) and α ERDs modeled the responses of primary sensorimotor (M1-S1), supplementary motor (SMA), and posterior parietal (PP) areas during the preparation and execution of unilateral finger movements. Maximum responses were modeled in the contralateral M1-S1 during both preparation and execution of the movement (Babiloni et al, 1999).

Dataset

The EEG data set (“DataSet IV”) used in this project is available at: http://ida.first.gmd.de/~blanker/competition/ (Blankertz et al, 2002). The data consists of 416 epochs of 500 ms length each ending 130 ms before a key press. 316 epochs are labeled (0 for upcoming left hand movements and 1 for upcoming right hand movements). The performance of the classifiers will be measured on 100 epochs of the data (which will never be used during the learning phase). The data was sampled at 100 Hz, and 28 EEG channels were positioned according to the international 10/20-system. The dataset was part of a BCI patter recognition competition that was launched in 2003. The performance of the classifiers developed here will be compared to the performance of other classifiers on that competition. The average signal power for the left and right hand movements are shown in Figure 1 on an interpolated plot. From the average of the signal values, it’s not possible to distinguish left and right hand movement separately.

Figure 1- Average values for all 316 values of the channels (top of the image is the front of the head). Left hand movement (left) and right hand movement (right) are indistinguishable from the time average.

Feature Extraction

The data set was manipulated in order to take explicit advantage the time-frequency information on the EEG signal. A short-time FFT (STFT) was taken with a hanning window of 25 samples and an overlap of 12 for each trial, the magnitude of the STFT was then used. Thus the 500-ms signal for a trial was broken into 3 sequential segments, with the STFT at each segment giving information in 4 Hz-wide bands from 0 to 50 Hz.

The data was further formatted into a single matrix where each column represented a feature and each row a trial. The data had 1092 columns (features) resulting from the following features: time step(3) x frequency band (13) x channels (28) = 1092. The data matrix had 315 rows, corresponding to all the 315 training trials from the original dataset. The formatted data was then transformed by LDA in order to reduce the dimensions of the feature space.

The LDA transformation on the data set was done in the following manner:

The features for each channel (time step(3) x frequency band (13) =39 features) was reduced in dimension by itself into 3 dimensions through LDA.
The matrix generated in step 1 was then combined for all channels, generating a matrix of 315 x 84 (84 features = 3 features x 28 channels).
The final transformed data was obtained by doing a LDA on the matrix from step2.

The reason for an LDA with 3 dimensions on each channel was because in this manner we could inspect the data separation (on 3D plots) for each of the 28 channels. They were then combined and the final matrix was also transformed into a 3 dimensional feature space for visualization. Figure 2 show the scatter plot of the final transformed matrix (the third dimension was omitted because the data was not separable in that dimension). The transformation matrices used on the steps above were saved and applied to the test data.

Figure 2- Scatter plot of the final transformed data. Notice that there is some overlap but most of the data can be separated linearly.

Classifiers

Simple Linear Classifier

A simple linear classifier was generated according to the following rule. The mean of both classes in the first dimension was estimated and a threshold was estimated at the middle of their values (0.0611). The classifier then assigned a class to a sample based on this threshold value. The training error for this classifier was 3%. However, on the test data set (100 samples) the error was 30%.

Bayes Classifier

The bayes classifier assigned a class to a sample based on the biggest likelihood of the first feature. The likelihood for each class was estimated by creating a histogram of the first feature data over all the data for each of the classes and smoothing the histogram with a low-pass averaging filter of 10 taps.

Figure 3- Estimated likelihood distributions over the first feature space for both classes. The likelihoods were estimated by smoothing the histogram with a low-pass filter over a 10 sample window.

The decision rule for a sample was based on the nearest-neighbor likelihood value. For example, if first feature of the test sample had a value of 1.14, then a search was done on first feature value closest to this one for both of the classes. Their likelihood value was then compared and the sample was assigned to the class that had the highest likelihood. The test error for this classifier was 31 %.

Hidden Markov Classifier

The final classifier was a Hidden Markov Classifier. The goal of this design was to try to achieve some dependency across channels as well as along time.

The feature extraction was done in similar manner to the other classifiers, the signals were broken into 3 time steps and their frequencies estimated. This time however, the feature reduction was done by selecting the features that were statistically different from the left and right hand movements (p<0.01). This was done by means of an analysis of variance on the data (ANOVA) on the phase information of their FFT. The reduction using ANOVA left us with six significant frequencies and 17 significant channels (compared to the original 13 frequencies and 28 channels).

The data was model with a first order hidden markov model that was non-ergodic (Rabiner, 1989) (the sequence always started on the first state and progressed sequentially to the last state). Each state represented the phase information observed at each channel, so there were 17 states. Since there were six significant frequencies and three time steps, 18 different HMM for each hand movement was created. The observation vector was the phase information of each frequency rounded to the nearest integer between -3 and 3, in order to be able to create a discrete observation vector. The parameters of the models were estimated using the software developed by Kevin Murphy (1998). The transition matrix was held fixed and set so that the next state was the next highest state (ie: if the current state was state five the next state would be state six).

Once the parameters for the model were estimated a measure of the accuracy of each of the 18 classifiers was estimated using a validation set of 35 trials. A weight vector was then generated based on the following equation:

W= accuracy./sum(accuracy)

This weight vector was created in order to weight the output of each classifier according to its accuracy before summing them and deciding on a class. The final results of the HMM on the test dataset yielded an error estimate of 51%, which was just as good as guessing by chance.

Summary

Overall the best classifier was the linear classifier, yielding a percent error of 30%. It appears that both the linear and the bayes classifier were susceptible to over fitting due small training error and large test error. This seems to reflect the curse of dimensionality of this problem. The data set is too small and the feature space is high for these classifiers. The HMM classifier had the poorest performance of all the three classifiers. It quite possible that there could have been an error on my code or in the interfacing with the third-party HMM Toolbox. One of the toughest issues of the HMM model was how to design the model (i.e., what should be the observation vector?, how many states?, what do states represent?, and how to connect the different HMMs). While its not obvious how to choose such parameters, its is obvious, in light of the error I got, that my choice is not the best way to proceed with it. . In using the ANOVA for feature reduction, its quite possible too that the data was not gaussian (an assumption made by ANOVA). As an additional note, the phase information of the signals might also share some fault on the poor performance of the classifier. While more accuracy might have been obtained with the magnitude information, the software that I had available did not support continuous observation vectors.

Overall, it seems that the most challenging aspect of this project was the feature extraction (or dimension reduction). This played a real crucial role in the HMM model, were the design of the model is very sensitive to feature characteristics (i.e.: continuous versus discrete observations). This was a great opportunity to get a good knowledge of feature extractions and the Bayes and HMM classifiers. It would be interesting to see how some of the cutting-edge methods, such as EM and SVM would perform in this problem (keeping in mind that performance is also strongly tied to how well you tackle the feature extraction part). It seems that a variety of approaches can be taken to tackle the feature extraction and classification problem and the best one would have to be determined by trial and error.

Edlinger & Guger (2002), EEG Data Processing and Classification with g.BSanalyze Under MATLAB, MATLAB Digest, Vol 10 (6), www.mathworks.com

Delorme, Makeig, & Sejnowski (?), Automatic Artifact Rejection For EEG Data Using Higher-Order Statistics And Independent Component Analysis, from http://ica2001.ucsd.edu/index_files/pdfs/117-delorme.pdf

Maurer & Dierks (1991). Atlas of Brain Mapping, Springer-Verlag, New York.

Dolfus (1986). Chapter 8 in, Topographic mapping of brain activity, Duffy Butterworth, Boston.

J. R. Wolpaw et al., IEEE Trans. Rehabil. Eng. 8, 164(2000)

Benjamin Blankertz, Gabriel Curio and Klaus-Robert Müller, Classifying Single Trial EEG: Towards Brain Computer Interfacing, In: T. G. Diettrich and S. Becker and Z. Ghahramani (eds.), Advances in Neural Inf. Proc. Systems 14 (NIPS 01), 2002

Kevin Murphy, 1998 Website:

http://www.ai.mit.edu/~murphyk/Software/HMM/hmm.html

Rabiner L., (1989), A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Proceedings of the IEEE, Vol. 77 #2, February 1989