next up previous contents index
Next: Discussion Up: etherSound - an interactive Previous: Introduction   Contents   Index

Subsections

The Design

etherSound is an attempt to open a musical work to the uninitiated and provide for a notion of `democracy of participation': all contributions are equally valuable. Accessibility without prior knowledge of music or musical training is an end in itself in this project. It should be noted that this obviously presupposes that the participant knows how to send a SMS and that the system makes it difficult for those who are not familiar with this technology1. It should also be made clear that, using SMS text messages for interaction as it is implemented here, does not allow for direct dynamic control. Every message generates one `message-composition' and all control data is derived from the content of the message.

Figure 1: Communication in the first version.
Image Figure1

Communication - first model

In the first version, realized in August 2003, the communication between the participant and the system was accomplished according to Figure 1. A sms sent to the specified number was transformed to a XML file and transferred to a URL by a HTTP POST request. This part was handled through an external service. At the called URL, a JSP (Java Server Pages) was directing the POST data to a Java Bean [J2EE 1.4.1, 2004] that handled the parsing of the data and the connection to a MySQL database in which it created a new entry with the relevant fields.

It was due to security reason at the museum where this version was realized that the HTTP request could not be handled locally. Instead, the local computer queried the server database for new entries on regular intervals. After some testing, sending a SQL query once every second seemed like a reasonable time interval. Shorter time intervals didn't accomplish a perceivably quicker response time and, since the synthesis program was running on the same machine, I didn't want to use more processing and network activity than necessary for this task (see section 3 for further discussion). After the text message had been processed, control signals where sent by MIDI to the synthesis engine.

Communication - current model

Although the first version worked well and was fairly stable, it was a solution that required an external SMS processing service, and a local, reliable network connection. In order to make the piece more 'portable' and independent, the message receiving part has been rebuilt. Using the gnokii API [gnokii, 1995] it is relatively easy and reliable to connect a GSM phone to a computer and gain access to the storage and status of the phone which enables reception of the SMS messages locally. To have the possibility to review the activity of transmission, the messages are, just as in the first model, written to a database. In other words, the client-server model is retained but on one and the same machine. Furthermore, the MIDI connection between the control application and the synthesis engine has been replaced with OpenSound Control (OSC) [Wright et al., 2003,OSC, 1997] for speed, reliability and flexibility, using the library JavaOSC (see http://www.mat.ucsb.edu/~c.ramakr/illposed/javaosc.html).

The text analysis

The program handling the text processing and the mapping of text to control signals for the sound synthesis is written in Java [J2SE 1.4.2, 2004] and features a simple but useful GUI for control and feedback about the status of the system. It is here, in the mapping between the text and the sound, that the compositional choices have been made. There are three groups of parameters that are being extracted for every message: For the timing there are two parameters; a local 'life' index shaping the rhythms and the length of the current message and a global index that influences the current and subsequent 'message-compositions'. The global index is a function of the current and previous messages local indexes. The purpose of the local index is to make a simple semantic analysis of the message and thus discriminate between a set of random letters and real words. The participant should be 'rewarded' for the effort of writing a message with substance, where `substance' is defined here as a short message with a credible average word length and a reasonable distribution of vowels within these words. The local index is calculated by looking at the average length of words and the average number of syllables per word and comparing these with constants:

$\displaystyle i_1=\frac{1}{(w(\frac{c}{w_c})-w_l)^{1/2}+1} \qquad i_2=\frac{1}{(w(\frac{s}{w_c})-s_l)^{1/2}+1}$ (2.1)

where $ c$ and $ s$ are the total number of characters and syllables, $ w_c$ is the number of words in the current message, $ w_l$ and $ s_l$ are constants defining the `optimal' mean number of words/syllables. $ w$ is a weight defined by

$\displaystyle w=\frac{1}{w_c-s_c+0.5}$ (2.2)

where $ s_c$ is the total number of words that contains vowels. Through $ w$, the index is decreased if the message contains words without vowels. The mean value of $ i_1$ and $ i_2$ is then multiplied by the arcus tangens of the number of words in relation to a third constant parameter, $ o_w$, delimiting the optimal number of words per message2 according to (2.3).

$\displaystyle lifeIndex = \frac{i_1+i_2}{2}{\arctan (\frac{w_c}{o_w})}$ (2.3)

If we set $ w_l$ to 4.5, $ s_l$ to 2.0 and $ o_w$ to 10 the result on four different messages can be seen from Table 1; the method distinguishes fairly well between nonsense and real words at a low computational cost. Similar or better results could conceivably be achieved in a number of different ways but this method appears to work well for the purpose. Since there is only audio feedback, it is important that all, even empty messages, will lead to a perceptible change in the sonic output.

Table 1: Life index for four different messages
message life index
hello 0.18882
Hello, my name is Henrik 0.81032
hjdks la s duyfke jhsldf hasdfiw uehr jkdsl 0.14448
From fairest creatures we desire increase, That thereby beautys rose might never 1.44618


The total length of the music derived from the message is calculated by multiplying a constant preset time with the local index. Any new messages received adds its local index to the instantaneous global index which constantly decreases exponentially at a set rate3. If a message causes the global index to reach maximum, it stops the playback of the current message and begins playing back a precomposed pattern, sonically different from the output of a typical message, for about 30 seconds before resuming ordinary mode and starts playing back the message that caused the break. This feature is added to reward collaborative efforts. The global index controls mainly the density and the overall volume of the output, but also the distribution of random and stochastic processes in the synthesis.

The synthesis

The synthesis engine is written as a Csound orchestra [Boulanger, 2000] (see also http://www.csounds.com/) running inside a Max/MSP (http://www.cycling74.com/products/maxmsp.html) patch through the use of the csound$ \sim$ object (see http://www.csounds.com/matt/). The `score' for the message to be played back is sent to Max/MSP using OSC. Max/MSP is responsible for timing the note events and preparing valid information for the csound$ \sim$ object and the orchestra file associated with it. Due to processing power limitations only one message can be played back simultaneously; if a message is received before the previously received message has finished playing back, the new message will interrupt the current message.

Figure 2: Amplitude envelopes for Instrument A. Linear interpolation between these two envelopes is performed for every character between A and Z.
\begin{figure}\centering
\par\setlength{\unitlength}{0.240900pt}
\ifx\plotpoi...
...\put(181.0,123.0){\rule[-0.200pt]{0.400pt}{157.549pt}}
\end{picture}\end{figure}

All sounds heard in etherSound are generated with FOF (Fonction d'Onde Formantique) synthesis as this technique is implemented in Csound [Clarke, 2000,Byrne Villez, 2000], using both samples and simple sine waves as sound sources. There are two distinct timbres played by two different instruments in each `message-composition': (A) granulated samples of a male reading a text in english4 and (B) a bell like sound whose timbre is governed by the series of vowels in the text. The timbre as well as the generative rules of the first voice are in contrast with the second voice.

Instrument A

Every word of the message is considered one phrase or bar of music in the resulting message composition. The number of beats per bar is aproximately equal to the number of syllables in the word, where a syllable is defined as a vowel or group of consecutive vowels or a punctuation mark. The rhythmic subdivision of each bar is equal to the number of characters, including punctuation and white space, in each syllable. Thus, a one syllable word such as `my' followed by a white space results in a phrase consisting of one bar of one beat and two notes and one pause, i.e. three (eight-note) triplets of which the last is silent (see Table 2). If a word ends with a full stop, a comma, an exclamation mark or a question mark, more emphasis is put on the end of the bar containing the punctuation mark and the last note of the resulting phrase will be elongated. A note close to a vowel will more likely be accented than a note away from a vowel.

The amplitude envelope curve of each note is related to the letter the note corresponds to. Envelopes are mapped linearly to characters; letter `A' has a short attack and a long decay and letter `Z' has a long attack and a short decay (see Figure 2). The amount of overlapping between notes, i.e. the lengths of the notes, is influenced by the current life index and the global index where higher values will result in longer notes and thus in smoother transitions between timbres. The notes of Instrument A does note have a perceivable pitch. Twenty-eight short sample buffers (typically 32.768 samples or aproximately 0.7 seconds), one for each letter, are mapped one to one to the characters in the message. The FOF synthesis is used to granulate these samples, creating an erratic, non-tonal texture however still, in most cases, reminiscent of speech.

Figure 3: Rhythmic distribution of notes in Instrument B.
Image harm

Instrument B

The phrasing of the notes of the second instrument is somewhat more complex than that of Instrument A. This instrument has, at the most5, as many voices as there are words in the message. If the polyphony of this instrument is limited to four voices the rhythmic mapping of the notes using the message in Table 2 is shown in Figure 3. For this instrument the number of beats per bar (i.e. per word) is equal to the number of letters per word, including trailing punctuation marks and white space. If there are less words than the maximum polyphony, the number of voices is equal to the number of words; the first voice correspond to the first word, the second voice to the second word and so forth. For every bar, each voice has as many potential excitations as there are letters in the corresponding word. After the initial excitation, which will always be played, the likelyhood that a given note will be played is related to the life index and the global index: If the normalized sum of the local index and the global index is 0.5, half of the excitations will be performed. The amplitude envelope curve for the notes played by this instrument is either of a bell like character or of its inversion, and notes close to the beginning of a bar has a greater likelyhood of being emphasized.
Figure 4: Harmony for Instrument B as a result of a received message ``Hello, my name is Henrik.''.
Image chord
The initial pitches are derived from the ocurrence of certain key letters in the originating text6. The first unique occurrence of one of the key letters, searched for from the first letter of the word corresponding to the current voice until the end of the message, becomes the initial pitch for each voice. If none is found the voice is deleted. The voicing of the initial chord is constructed so that the first voice will be the top note of the chord and consecutive voices will be laid out below this using octave transposition, aiming for the closest possible voicing.

The exact microtonal centre pitch between the highest and lowest note of the initial chord is then calculated (this would be the pitch `D' if the initial chord is a major third up from `C'). After the initial chord has been introduced, all voices begin a virtual glissando toward the centre between the outer limits of the chord, creating microtonal variations of an ever decreasing harmony, ending at a unison. For each excitation of each voice, the instantaneous value of the corresponding glissando sets the pitch for that excitation. Figure 4 shows the initial chord and the glissandi towards the centre the message from Table 2 would result in if the max polyphony value is set to five or higher and the `key' characters were mapped by German note names (a to A, b to Bb, c to C, ... ,h to B and s to Eb).

The timbre of the voices played by this instrument is also shaped by the vowels contained in the message and the order in which they appear. For non real time processing this is achieved by synthesizing the first five formants of the first vowel found in the word corresponding to the current voice and then interpolating between the formant spectrum of the remaining vowels of the message (see Table 3). As this method is very expensive - it requires allocation of five times more voices - a cheaper variation has been implemented for real time usage. By modulating the formant frequency of one single FOF voice with Frequency Modulation whose carrier signal and index is derived from the vowel interpolation described above, the effect of molding the formants spectrum with data about the content of the message is retained. However, it should be made clear that the sonic output of these two models is very different.

Sound event generation and synthesis - conclusion

The two instruments offer two different interpretations of the message played back in paralell. As Instrument A performs a linear displacement within the message, Instrument B gives a snapshot image of the entire text at once, an image which is gradually dissolving over time. One instrument is modelling the discrete words and characters as they appear in time, the objective flow of the components of the message, and the other deals with the continuous meaning, or subjective understanding, of the message as it is understood in its entirety. Although the result can be rather complex and abstract it is my intention that certain conceptual elements of the input should be retained in the output.


next up previous contents index
Next: Discussion Up: etherSound - an interactive Previous: Introduction   Contents   Index
Henrik Frisk, Malmoe Academy of Music