Blind Source Separation from Audio File

Introduction

This project consisted in trying to recover individual components from audio files that have been linearly mixed into a single stereo file (g: 2xN). This is an underdetermined problem, where uniqueness and stability of the solution are issues to be considered. The forward model for instantaneous mixture can be stated mathematically as:

Where A is a 2x3, f is 3xN (for this project we used 3 sources), and g is 2xN. Notice that since A has more columns than rows this is an underdetermined problem, so uniqueness is an issue. The mixing matrix A is also unknown, so if we were to use the min-norm solution for the undetermined model an estimate of A would be required. In this project we studied the following approaches for reconstruction:

· Min-norm Solution with modified mixing matrix A

· Estimating mixing matrix A from sparse signal decomposition

The performance of each approach was measured by using the reconstruction error: e = ||f – fhat ||.

Creating the Data

The audio data (g) used in this project was created by linearly mixing 3 sound sources with the following matrix:

A= 0.5 0.4 0.55 g= A*[f1;f2;f3] fi : 1xN, g: 2xN

0.5 0.6 0.45

The first source was equally panned (same volume on both channels of the data) and consisted of the vocal recording of a male singer. The second source was slightly panned to the left side (second channel), and consisted of keyboard strokes. The last source was panned to the right side (first channel, opposite to the keyboard mixture), and it was mainly drum (percussion) sounds. The number of samples (N) for each source was 1,018,335, yielding 23 seconds of audio (sampled at a sampling frequency of Fs= 44100 Hz). The signal to noise ratio was high (>70 dB), allowing us to assume that noise was negligible for our purposes. The coefficients of the mixing matrix had positive values between 0 and 1, representing the percent of panning (gain) to the respective channel on the output. The outputs and the input signals were all normalized and were between -1 and 1.

Min-norm Solution with modified mixing matrix A

In this approach we tried to recover the source by manipulating the forward model so we could take advantage of the number of samples collected and forcing the mixing matrix to be square and thus invertible. For example, if 3 samples of data are available, than we can write the forward problem as:

Equation 1

Equation 2

In this case the mixing matrix (the first matrix on the right side of equation 1, will be invertible if the sources are not set to zero. This will the allow us to estimate the coefficients of our original A matrix (the vector on the right hand side of the equation above), note however that in order to do so requires an estimate of the sources f (equation 2). Once a reliable estimate of A is obtained, the sources can be estimated using the min-norm solution:

Equation 3

Since our estimation of the sources and of the mixing matrix required an initial guess of the original parameters (equations 2 and 3), a study of the sensitivity of the solutions with respect to perturbations of an ideal guess was performed. For equation 2 we observed how the reconstruction error varied as more noise was added to an ideal guess of the sources fh (ie: fh= f + n*a). For equation 3, we observed how the reconstruction error varied as more noise was added to an ideal guess of the mixing matrix, Ah (ie: Ah= A + n*a). Studying the sensitivity on the solution allow us to gauge how good an initial estimate of the parameters has to be in order for the solution to be accurate.

Figure 1- Reconstruction error as a function of the norm of (f- fo). For this plot an initial estimate of fh (fo) was obtained from the original sources plus varying noise levels: fho= f + n. The Ah matrix was then estimated using equation 2. Reconstruction was then done using equation 3 and the error was estimated from the second norm of the original minus estimated sources. As the distance between A and Ah grows, the error increases almost exponentially. The reconstruction is sensitive to how close we initialize our parameters.

Figure 1 shows the reconstruction error as a function of how much our initial guess diverged from the original sources (by adding gaussian noise). The plot is an average over 50 simulations. Ah was estimated from the perturbed data using equation 2, and the estimate of the sources was then using equation 3. To calculate Ah, the data was divided in blocks of 3 samples, and a running mean of Ah was done over the entire data set. Whenever Ah had coefficients bigger than 1, its value was removed from the running mean (notice that for this simulation we allowed negative coefficients and coefficients bigger than 0.5 but less than1, not using fully our priori knowledge). The error was measured as the second norm of the f-fh. The results are the average of 50 simulations. From figure 1 we can assume that we will obtain a good reconstruction of the sources only if our initial estimate is very close to the true source f. Figure 2 shows the sensitivity of the solution (equation 3) to perturbations on the Ah estimates.

Figure 2- Reconstruction error vs amount of signal to noise in Ah (dB). The source estimation was done using equation 3. The results are the average of 50 simulations.

At around 40 dB SNR the reconstruction of the sources stabilizes at a reconstruction error of 45 (at this level each source file is still a mixture of the original sources, however each of the independent sources is louder at their respective row). This bias seems to be related to the min-norm solution, the source characteristics and the length of our input signal. In any case, a reconstruction error of 45 represents to us the floor limit on what we should except of the reconstruction using the min-norm solution and a very accurate estimate of A for this problem.

A major problem with this approach is how to initialize the parameters. In order for this procedure to be of any practical use, a good estimate initial estimate of the sources or the mixing matrix must be done through the collected data. We decided to get an estimate of the source (in order to estimate the mixing matrix using equation 2) through the following procedure:

The estimate of the first two sources, fh1 and fh2, were obtained by whitening a matrix that consisted of the first channel and the difference between the two channels:

W=[g(:,1); g(:,1)-g(:,2)]

The difference of the two channels was used instead of the second channel because the difference between the channels did not include the component that was centered (the vocals). This might help on brining the uncorrelated components closer (in vector space) to the sources that are being panned. The last component (the centered component) was initialized by subtracting the projection of the two estimated components from the second channel:

fh3 = g(:,2) - <fh1,g(:,2)>*g(:,2) - <fh2,g(:,2)>*g(:,2)

The reasoning behind this was that if the first two components were close in vector space to the original sources, then a Gram-Schmidt orhthogonization on the second channel might give us the centered component or somewhere close to it.

Once the signals were initialized equation 2 was then used to estimate the mixing matrix, Ah, from the running average of the whole dataset (the dataset was broken into windows of 3 samples for the estimation). Whenever the coefficients of the mixing matrix Ah were negative or above 0.5, its values were not used on the average. The estimated mixing matrix obtained this way had the following values for our data:

Ah=[0.2432 0.2915 0.1099 ; 0.2717 0.2984 0.0751]

The values were significantly different from the original mixing matrix. However the estimated matrix has some similar structure. The middle column has values almost equal to each other (thus centering one of the sources). And the other two columns have values inverse to each other (one column pans a source to the right and the other pans the source to left). The reconstruction error for this case was e= 215.25. Although the reconstruction error is pretty high and all the estimated sources are a mixture of the original sources, the sound created from the estimated sources have a unique source appear louder than the than the rest. Possible ways to improve this procedure would be to use more of the priori knowledge on the data. Perhaps initializing the data so that the elements are uncorrelated might not be the best criteria as well (in music it might well be that their amplitude for example, are correlated). A gradient decent approach might not be a good solution here too. Since, if we try to estimate fh (by making Ah a function of fh), we might run into memory and computational problems as N increases. On the other hand, if we try a gradient descent on Ah (by making fh a function of Ah), the solution may never converge since it appears that only very accurate estimates of A yield reasonable results. The cost function may contained several local minima.

Estimating mixing matrix A from sparse signal decomposition

A possible way to estimate the mixing matrix A from the data is by using sparse signal decomposition. The idea behind this method is that if the original sources are sparse enough, then a scatter plot of the data will reveal information of the A matrix (the ratio of gains for each source). This is because, since the input signals are sparse, the chances of two signals being greater than zero at the same time is very unlikely. More information on this method can be found at Independent Component Analysis, Principles and Practices (Roberts and Everson, 2001).

In order to make our data sparse, a Short Time Fourier Transform (STFT) was performed on the data using the SPECGRAM function from MATLAB. The window size was ser to 4096 samples (~ 10 ms), with the original sampling frequency of the signal and default values for the function. A scatter plot of our transformed data was then performed using the absolute values of the STFT, Figure 3. Points in the transformed data vector that did not sum to 10 were truncated off (this helped reduce the overlap in near the origin). From the scatter plot of the data it is clear that there are three distinct orientations (lines). The slopes of these lines are related to the ratio of the gains of the first and second column of the mixing matrix A, ie: .

Figure 3- Scattered plot of the magnitude of the STFT of the two channels from the data. Notice that three lines are distinct on the plot.

In order to estimate the slope of the lines, the MATLAB function GINPUT was used in the scatter plot to obtain the data for each orientation (Figures 4). The mean of ratio of the magnitudes was then used as an estimate of the slopes (the red lines on Figures 4 are the lines obtained with the estimated slopes).

Figure 4- Slope estimates from the cluster s in Figure 3. The data was collected from the scatter plot using the function GINPUT. Slopes were estimated as the mean value of |G1|/|G2|.

The slope values obtained were m= [0.8246 0.9878 1.5156]. Using prior knowledge of the mixing matrix A (its values were positive and the sum of the rows were equal to 1, the estimation of the mixing matrix, Ah was done in the following way:

m= a1/a2 a1 + a2 =1 -> a1= 1/ (1 + 1/m) a2= 1- a1

So that our data give the following estimate of Ah:

a1=1./(1+(1./m)=[ 0.4519 0.4969 0.6025]

Ah=[a1;(1-a1)]= [ 0.4519 0.4969 0.6025 ; 0.5481 0.5031 0.3975]

We can improve our estimate further by using priori knowledge and noting that estimates closest to 0.5 should correspond to the component that is centered in the audio file, so we can set it equal to 0.5. This will give us an estimate the mixing matrix up to some random permutations of the columns and rows. If we search for the permutation that minimizes the observation error (e= |g- Ah*fh), using equation 3 to estimate fh, we will see that the best permutation is:

Ah*=[0.5 Ah(2,3) Ah(2,1);0.5 Ah(1,3) Ah(1,1)]

=[0.5000 0.3975 0.5481

0.5000 0.6025 0.4519]

This gives a reconstruction error on Ah of e=norm (A-Ah) = 0.0044. And a reconstruction error on f of:

e = norm (f-fh) = 47.78

Which is close to the 45 error floor that we have measured using the Min-norm Solution with modified mixing matrix A in part 2.

Conclusion

Overall there appears to be a limit on how well we can reconstruct the data for this problem using the min-norm solution. This limit may be decreased by adding further constraints to the solution (ie: smoothness). An accurate guess of one of the two parameters (either fh or Ah) is required to reach this floor. The reconstruction is very sensitive to the range of these parameters. The sparse signal decomposition coupled with priori knowledge of the mixing matrix, proved to be a useful tool in helping give an accurate estimate of the A matrix and reconstructing the source signals close to the expected error floor of 45.