« Seminar Dec. 11 | Main | OSX and NFS issues »

December 22, 2007

timbreMap program testing

The timbreMap program is part of PhD project and is designed to organize timbral features of its audio input in its 2D output space. It uses the JetNet implementation of Artificial Neural Networks by Lönnblad et al., in particular the Kohonan feature map. The Kohonan net is a self organizing (unsupervised training) feature map widely used in speach recognition. In the timbreMap program the network is fed a Bark scale transform of the input. In the screenshots below the output, the winning node, is represented by the black dot in the center window. There is no pre-conceived mapping of input to output and although a similar input will result in correspondingly similar output the trained weights may differ from and cause a different area in the output to respond to the same sound in two different training sets.

In the following screen capture we can observe the program while it attempts to organize its weights in response to three sine wave oscillators, crossfaded and tuned to three different frequencies. Thanks to the simplicity of the input the network organizes itself fairly quickly and optimizes its responses so that the winning node travels along the borders of the output map. Once the map is trained the network responds with the same output no matter the order or speed of its input.

Get the Flash Player to see this player.

In the next example the network has been trained on six different saxophone samples: Two ordinarily played notes, two "growled" notes and two multiphonics. What we see in the screen capture is the response of an already trained network. Though the output is noisier than the previous example there is a clear pattern to the responses. About halfway I add a simple synthesizer with a pitch tracker (using Miller Puckette's fiddle object in PD). The synthesis algorithm is a simple implementation of Phase Aligned Formant synthesis taken from the PD documentation (Chapter 3, F12). Then, I map the X axis of the network output to the synthesis formant center frequency, and the Y axis to the index parameter.

Get the Flash Player to see this player.

The mapping was done more or less arbitrarily, merely making sure the parameters would stay within reasonable ranges. Though the mapping is less sucessful on the multiphonics and the noisy growles, it makes perfect sense on the ordinary notes. Letting properties of the input control aspects of the output that belong to the same class of events, in this case seems to imply that the details of the mapping are less important. However for the noisy input, what we perceive as one sound in the input (a growl or a multiphonic), in the synthesis becomes an oscillation between two different sounds. Here, more care in the mapping is needed, or a "smearing" of the data to couteract the "jumpiness" of the output.

Posted by henrikfr at December 22, 2007 11:00 PM

Comments

Henrik, this is impressive stuff! I can't believe how quickly the map 'converges'. Hopefully I'll have something along similar-ish lines to share in the new year.
BTW, I think low-pass filtering or moving average to smooth the output data would help a lot to reduce the jumpiness. Also you might want to try the "look before you leap" algorithm - i.e. don't trust jumps greater than a certain threshold unless the data stays beyond that threshold for k or more values.

Posted by: Jamie Bullock [TypeKey Profile Page] at December 23, 2007 05:31 PM

Perhaps it's not clearly enough stated, but it's only the first clip in which the training starts out from scratch. In the second clip the weights are ''pre-trained''. Exactly what the training time should be for complex audio signals I'm not sure. But sensible output is achieved in about the same time frame as what may be observed in the first clip. After that the network becomes more stable and reliable.

Lowpass filtering is definitely one way to get. "Look before you leap" I haven't thought of, that would be a good thing to test. Also, the FFT size here is 1024 for the saxophone and 512 for the sinewaves. I'm thinking I should try large FFT sizes and small hop sizes (something like 8192/512) to get more subtle changes due to quick timbral shifts in the input without loosing the time resolution. What would be really nice would have a feedback of information: Let the properties of the output was used to set the parameters of the analysis.

Posted by: henrikfr [TypeKey Profile Page] at December 23, 2007 11:26 PM

Post a comment

Thanks for signing in, . Now you can comment. (sign out)

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)


Remember me?

(You may use HTML tags for style)