Cesium SoundGesture Synthesis Research and Development by Nick Longo

U.S. Patent 6,066,794 U.S. Patent 7,176,373

The Theory That Led to Gesture Synthesis

In 1991 I set out to write an article about some research I'd done in 1985, concerning the relationship of sound synthesis to music perception. I never finished the article because while writing the conclusion, an idea occured to me that I felt was important for the future development of synthesis technology. I elected to keep it secret for awhile and look into the possibility of filing a patent application.

The theory is derived from an understanding of sound as a perceptual phenomenon rather than as acoustical vibrations. This may be understood by considering the classic question of whether a sound occurs if a tree falls in a forest and there is nobody there to hear. According to the laws of physics, natural events that result in the perception of sound involve the propagation of energy through matter, via microscopic collisions between atoms and molecules. When you hear a tree fall in a forest however, you don't hear every colliding air molecule. Instead you hear a sound you can identify as a single entity--something like a "woosh". Likewise you don't hear every snapping molecular bond as the tree hits the ground. You hear a "crash". The sensation of sound is produced in your mind by perceptual processes that capture, encode, and reproduce only a tiny part of the energy released when the tree falls.

Sound perception may be compared to the more familiar processes of visual perception. When you enter a room you may tell at a glance that there is a table with a newspaper on it for example, and a bookshelf with numerous volumes including, a collection such as an encyclopedia. You rely on internalized cognitive abilities, unlike an infant, who may crawl around the room for hours absorbing its contents. Using depth perception you know you have entered an enclosed space of a certain size. Using shape perception you recognize the table even though it is partly covered by a newspaper. The familiar look of newsprint triggers a pattern recognition response, and color perception helps you identify the uniform bindings of an encyclopedia.

When you hear music, you use aural perceptual modes that are comparable to those used for sight. For example, musical tone is similar to color. Like color, it can be analyzed in terms of component frequencies. The pitch and duration of notes may be compared to the spacial dimensions of height and length. These are the parameters traditionally used to write music in a visually perceivable form. Binaural hearing gives you a sense of a sound's location, similar to how paralax vision creates depth perception. Musical shape is perceived as "envelopes" and textural patterns as separable components such as breath, scrape, and "woosh".

If you are familiar with music synthesizers you may recognize that what I've just described are the parameters used to synthesize sounds. I first recognized this striking parallel in 1985. Then I realized that standard synthesis methods work not by modeling naturally occuring energy processes, but by mirroring the perceptual modes your ears and brain use to produce the sensation of sound in your mind. This creates the illusion of naturally occuring sound.

By extending this theory, stereo loudspeakers can be seen as a reflection of ears. That is, speakers turn electrical impulses into mechanical energy that propagates through the air, which is a process opposite to the way your ears turn energy from air molecules that enters them into electrical impulses that are then sent to your brain. Likewise, stereo works in a way that mirrors binaural hearing. This is why quadraphonic sound never caught on. Four speakers may give a slight improvement for recreating earthquakes or other environmental effects, but you almost always "view" sound from the front. Two speakers actually do a pretty good job of creating the illusion of spatial location, because they work the way ears do, only in reverse. Electronic processors based on binaural perception enhance this effect. Extending the theory further, the light emitting display on a synth panel works in the opposite way to how eyes work.

In this light, the construction of the piano keyboard is clearly a reflection of the human hand. Like fingers, the keys are arranged in a linear side by side manner. Fingers push down in one direction and the keys push back. However, the keyboard is only a digital interface. Each key is just a switch, which can only be turned on to a single amount and turned off sometime later. Mathematically it is actually a base twelve digital interface, requiring lateral displacement of the hand's position at least every octave, (although base ten might have been more comfortable). Traditionally, expression is achieved by creating pointillistic suggestions of a continuous gestural space.

What finally occured to me is that what's missing from synthesizers is a reflection of muscles. Continuous muscular exertion is required to play a traditional musical instrument, and there is no analog for this in the synthesis architecture. One solution was proposed by my friend and computer music pioneer Scot Gresham-Lancaster. In 1984 Scot had done some force feedback experiments at The Exploratorium in San Francisco, using linear motors like the ones that move the heads in hard disk drives. He figured that to simulate interaction with a musical instrument, a motion detector and programmable circuitry would also be required . To emulate a guitar bend, for example, the motor would have to push back with an exponentially increasing amount of force, the way a guitar string does, and also respond to velocity and acceleration.

So I pulled a motor out of an old crashed hard disk and sat down to figure out the exact mathematical relation for a guitar string. But then I remembered from my introductory engineering class that metal wire is elastic, and that elasticity is linear. So if a guitar string, or say a rubber band is linearly elastic, then why does it become exponentially harder to stretch the further you stretch it, until it is almost infinitely difficult? An infinitely resistive rubber band? That's absurd.

Then I had something of a revelation. If what seems like non-linear interaction isn't really non-linear, then it must be the perception of the interaction that is non-linear. That is, it must be that muscle activation is also a perceptual process and this sense of nonlinearity is a perceptual property of muscles. There is a finite limit to your muscles' strength and as you reach your limit, you can exert an increasing amount of force only with greater and greater difficulty. It only seems like a rubber band has nearly infinite resistive power because it is difficult to exert enough force to actually break one. Then I realized that just like other perceptual modes, this nonlinear property of muscle activation could be modeled electronically. After that it took several years to map out the parameters of gesture modelling. But it was the counterintuitive nature of this problem that had kept the solution hidden.

Another insight that proved valuable is that gestures are wavelike in nature. This is because muscles always operate in pairs. One muscle pulls in one direction, while the other controls the motion by pulling in the opposite direction. Then the roles are reversed. This system of opposing cyclic forces closely resembles the phenomena known as simple harmonic motion, which causes sound wave vibrations. Breathing, walking, chewing, waving, and playing music are all performed with cyclic motions.

From Theory to Invention

By treating gestures as waveforms, I applied the principles of sound synthesis to create synthesized gestures. The difference is that gestures occur in the time domain rather than the frequency domain. These domains are mathematical constructs that describe physical phenomena. Without going into the math or the philosophy, they represent modes of time perception. In the time domain, wavelike phenomena give you a sense of horizontal time, like a drum beat that sounds like a train passing. In the frequency domain, waves become sustained tones that sound suspended in time, but can be stacked vertically, like harmonized voices that sound like layers of icing on a cake. Musical expression may be thought of as a third domain of time perception that gives the sense of emotional depth you associate with a live performance. This effect is achieved by creating a time horizon of possible outcomes of continuous musical trajectories. Performance gestures help create anticipation, that is continuously resolved, denied, or modified.

There are three basic parameters to gestures. The time of a gesture is the period of the gesture wave, the distance or musical interval is the amplitude of the wave, and the shape is like the waveshape of a tone, which is related to tone quality or timbre. In particular, the characteristic shape of a gesture gives it a certain musical quality that is not only perceivable, but is a recognizable component of the sound of an instrument. And like the tone of an instrument, part of the quality of a gesture has to do with the way the waveshape changes in time, especially in relation to its musical context.

Using these parameters, and an underlying set of muscle activation models, I adapted the entire modular synthesis architecture to simulate instrumental interaction, with extremely satisfying results. Gesture Synthesis is triggered by motion of a control operator, such as a mod wheel. When the operator stops, the synthesis function may be turned off. This is just like triggering a synthesized note by depressing a key, then sending note-off by releasing the key. In between, the gesture synthesis function is modulated by movement of the operator itself. The speed you move the wheel for example, and the way you speed up and slow down, work the way an envelope or continuous controller data works to modulate a synthesized tone. You can also use another operator such as a pedal or aftertouch to modulate the gesture synthesis function, or modulate the modulation.

Further development of this technology involves various synthesis techniques based on mathematical analysis of time domain musical phenomena. I now have two U.S. Patents, and I also developed a prototype for Macintosh some years ago called Flex Processor 1.3 that was reviewed in Electronic Musician Magazine, and at http://www.electronicmusic.com.

copyright 1996, 1999 Nick Longo

Cesium Sound

Gesture Synthesis

Gesture Synthesis
MIDI Sounds

MIDI Sounds


Early Research